Methods and compositions for targeted nucleic acid sequencing

ABSTRACT

The present invention is directed to methods, compositions and systems for capturing and analyzing sequence information contained in targeted regions of a genome. Such targeted regions may include exomes, partial exomes, introns, combinations of exonic and intronic regions, genes, panels of genes, and any other subsets of a whole genome that may be of interest.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.14/927,297, filed Oct. 29, 2015, which claims the benefit of U.S.Provisional Application No. 62/072,164, filed Oct. 29, 2014, each ofwhich is expressly incorporated herein by reference in their entiretyfor all purposes.

BACKGROUND OF THE INVENTION

The ability to sequence genomes accurately and rapidly isrevolutionizing biology and medicine. The study of complex genomes, andin particular, the search for the genetic basis of disease in humans,involves genetic analysis on a massive scale. Such genetic analysis on awhole genome level is costly not only monetarily but also in time andlabor. These costs increase with protocols involving analyses ofseparate individual DNA samples. Sequencing (and re-sequencing) ofpolymorphic areas in the genome that are linked to disease developmentwill contribute greatly to the understanding of diseases, such ascancer, and therapeutic development and will help meet thepharmacogenomics challenge to identify the genes and functionalpolymorphisms associated with the variability in drug response. Screensfor numerous genetic markers performed for populations large enough toyield statistically significant data are needed before associations canbe made between a given genotype and a particular disease.

One way to reduce the costs associated with genome sequencing whileretaining the benefits of genomic analysis on a large scale is toperform high throughput, high accuracy sequencing on targeted regions ofthe genome. A widely used approach captures much of the entire proteincoding region of a genome (the exome), which makes up about 1% of thehuman genome, and has become a routine technique in clinical and basicresearch. Exome sequencing offers advantages over whole genomesequencing: it is significantly less expensive, is more easilyunderstood for functional interpretation, is significantly faster toanalyze, makes very deep sequencing affordable, and results in a datasetthat is easier to manage. A need exists for methods, systems andcompositions for the enrichment of target regions of interest for highaccuracy and high throughput sequencing and genetic analysis.

SUMMARY OF THE INVENTION

Accordingly, the present invention provides methods, systems andcompositions for obtaining sequence information for targeted regions ofthe genome.

In some aspects, the present disclosure provides a method for sequencingone or more selected portions of a genome, the method generallyincluding the steps of: (a) providing starting genomic material, (b)distributing individual nucleic acid molecules from the starting genomicmaterial into discrete partitions such that each discrete partitioncontains a first individual nucleic acid molecule; (c) fragmenting theindividual nucleic acid molecules in the discrete partitions to form aplurality of fragments, where each of the fragments further includes abarcode, and where fragments within a given discrete partition eachinclude a common barcode, thereby associating each fragment with theindividual nucleic acid molecule from which it is derived; (d) providinga population enriched for fragments including at least a portion of theone or more selected portions of the genome; (e) obtaining sequenceinformation from the population, thereby sequencing one or more selectedportions of a genome.

In further embodiments and in accordance with the above, providing thepopulation enriched for fragments including at least a portion of theone or more selected portions of the genome includes the steps of (i)hybridizing probes complementary to regions in or near the one or moreselected portions of the genome to the fragments to form probe-fragmentcomplexes; and (ii) capturing probe-fragment complexes to a surface of asolid support; thereby enriching the population with fragments includingat least a portion of the one or more selected portions of the genome.In yet further embodiments, the solid support includes a bead. In stillfurther embodiments, the probes include binding moieties and the surfaceinclude capture moieties, and the probe-fragment complexes are capturedon the surface through a reaction between the binding moieties and thecapture moieties. In further examples, the capture moieties includestreptavidin and the binding moieties include biotin. In still furtherexamples, the capture moieties comprise streptavidin magnetic beads andthe binding moieties comprise biotinylated RNA library baits.

In some embodiments and in accordance with any of the above, the methodsof the invention include the use of capture moieties that are directedto whole or partial exome capture, panel capture, targeted exon capture,anchored exome capture, or tiled genomic region capture.

In yet further embodiments and in accordance with any of the above, themethods disclosed herein include an obtaining step that includes asequencing reaction. In further embodiments, the sequencing reaction isa short read-length sequencing reaction or a long read-length sequencingreaction. In still further examples, the sequencing reaction providessequence information on less than 90%, less than 75%, or less than 50%of the starting genomic material.

In still further embodiments, the methods described herein furtherinclude linking two or more of the individual nucleic acid molecules inan inferred contig based upon overlapping sequences of the isolatedfragments, wherein the inferred contig comprises a length N50 of atleast 10 kb, 20 kb, 40 kb, 50 kb, 100 kb, or 200 kb.

In yet further examples and in accordance with any of the above, themethods disclosed herein further include linking two or more of theindividual nucleic acid molecules in a phase block based uponoverlapping phased variants within the sequences of the isolatedfragments, where the phase block comprises a length N50 of at least 10kb, of at least 20 kb, of at least 40 kb, of at least 50 kb, of at least100 kb or of at least 200 kb.

In still further embodiments and in accordance with any of the above,the methods disclosed herein provide sequence information from selectedportions of the genome that together cover an exome. In yet furtherembodiments, the individual nucleic acid molecules in the discretepartitions include genomic DNA from a single cell. In still furtherembodiments, the discrete partitions each include genomic DNA from adifferent chromosome.

In further aspects, the present disclosure provides a method ofobtaining sequence information from one or more targeted portions of agenomic sample. Such a method includes without limitation the steps of:(a) providing individual first nucleic acid fragment molecules of thegenomic sample in discrete partitions; (b) fragmenting the individualfirst nucleic acid fragment molecules within the discrete partitions tocreate a plurality of second fragments from each of the individual firstnucleic acid fragment molecules; (c) attaching a common barcode sequenceto the plurality of the second fragments within a discrete partition,such that each of the plurality of second fragments are attributable tothe discrete partition in which they are contained; (d) applying alibrary of probes directed to the one or more targeted portions of thegenomic sample to the second fragments; (e) conducting a sequencingreaction to identify sequences of the plurality of second fragments thathybridized to the library of probes, thereby obtaining sequenceinformation from the one or more targeted portions of the genomicsample. In further embodiments, the library of probes are attached tobinding moieties, and before the conducting step (e), the secondfragments are captured on a surface comprising capture moieties througha reaction between the binding moieties and the capture moieties. Instill further embodiments and prior to the conducting step (e), thesecond fragments are amplified before or after the second fragments arecaptured on the surface. In yet further embodiments, the bindingmoieties comprise biotin and the capture moieties comprise streptavidin.In still further embodiments, the sequencing reaction is a short read,high accuracy sequencing reaction. In still further embodiments, thesecond fragments are amplified such that the resultant amplificationproducts are capable of forming partial or complete hairpin structures.

In further aspects and in accordance with any of the above, the presentdisclosure provides methods for obtaining sequence information from oneor more targeted portions of a genomic sample while retaining molecularcontext. Such methods include the steps of: (a) providing startinggenomic material; (b) distributing individual nucleic acid moleculesfrom the starting genomic material into discrete partitions such thateach discrete partition contains a first individual nucleic acidmolecule; (c) fragmenting the first individual nucleic acid molecules inthe discrete partitions to form a plurality of fragments; (d) providinga population enriched for fragments that include at least a portion ofthe one or more selected portions of the genome; (e) obtaining sequenceinformation from the population, thereby sequencing one or more targetedportions of the genomic sample while retaining molecular context. Infurther embodiments, prior to the obtaining step (e), the plurality offragments are tagged with a barcode to associate each fragment with thediscrete partition in which it was formed. In still further embodiments,the individual nucleic acid molecules in step (b) are distributed suchthat molecular context of each first individual nucleic acid molecule ismaintained.

In some aspects, the present disclosure provides methods of obtainingsequence information from one or more targeted portions of a genomicsample. Such methods include without limitation steps of (a) providingindividual nucleic acid molecules of the genomic sample; (b) fragmentingthe individual nucleic acid molecules to form a plurality of fragments,where each of the fragments further includes a barcode, and wherefragments from the same individual nucleic molecule have a commonbarcode, thereby associating each fragment with the individual nucleicacid molecule from which it is derived; (c) enriching the plurality offragments for fragments containing the one or more targeted portions ofthe genomic sample; and (d) conducting a sequencing reaction to identifysequences of the enriched plurality of fragments, thereby obtainingsequence information from the one or more targeted portions of thegenomic sample. In further embodiments, the enriching step includingapplying a library of probes directed to the one or more targetedportions of the genomic sample. In yet further embodiments, the libraryof probes are attached to binding moieties, and prior to the conductingstep, the fragments are captured through a reaction between the bindingmoieties and the capture moieties. In exemplary embodiments, thereaction between the binding moieties and the capture moietiesimmobilizes the fragments on a surface.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 provides a schematic illustration of identification and analysisof targeted genomic regions using conventional processes versus theprocesses and systems described herein.

FIG. 2A and FIG. 2B provide schematic illustrations of identificationand analysis of targeted genomic regions using processes and systemsdescribed herein.

FIG. 3 illustrates a typical workflow for performing an assay to detectsequence information, using the methods and compositions disclosedherein.

FIG. 4 provides a schematic illustration of a process for combining anucleic acid sample with beads and partitioning the nucleic acids andbeads into discrete droplets.

FIG. 5 provides a schematic illustration of a process for barcoding andamplification of chromosomal nucleic acid fragments.

FIG. 6 provides a schematic illustration of the use of barcoding ofchromosomal nucleic acid fragments in attributing sequence data toindividual chromosomes.

FIG. 7 illustrates a general embodiment of a method of the invention.

FIG. 8 illustrates a general embodiment of a method of the invention.

DETAILED DESCRIPTION OF THE INVENTION

The practice of the present invention may employ, unless otherwiseindicated, conventional techniques and descriptions of organicchemistry, polymer technology, molecular biology (including recombinanttechniques), cell biology, biochemistry, and immunology, which arewithin the skill of the art. Such conventional techniques includepolymer array synthesis, hybridization, ligation, phage display, anddetection of hybridization using a label. Specific illustrations ofsuitable techniques can be had by reference to the example herein below.However, other equivalent conventional procedures can, of course, alsobe used. Such conventional techniques and descriptions can be found instandard laboratory manuals such as Genome Analysis: A Laboratory ManualSeries (Vols. I-IV), Using Antibodies: A Laboratory Manual, Cells: ALaboratory Manual, PCR Primer: A Laboratory Manual, and MolecularCloning: A Laboratory Manual (all from Cold Spring Harbor LaboratoryPress), Stryer, L. (1995) Biochemistry (4th Ed.) Freeman, N.Y., Gait,“Oligonucleotide Synthesis: A Practical Approach” 1984, IRL Press,London, Nelson and Cox (2000), Lehninger, Principles of Biochemistry3^(rd) Ed., W. H. Freeman Pub., New York, N.Y. and Berg et al. (2002)Biochemistry 5^(th) Ed. W. H. Freeman Pub., New York, N.Y., all of whichare herein incorporated in their entirety by reference for all purposes.

Note that as used herein and in the appended claims, the singular forms“a,” “an,” and “the” include plural referents unless the context clearlydictates otherwise. Thus, for example, reference to “a polymerase”refers to one agent or mixtures of such agents, and reference to “themethod” includes reference to equivalent steps and methods known tothose skilled in the art, and so forth.

Unless defined otherwise, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this invention belongs. All publications mentionedherein are incorporated herein by reference for the purpose ofdescribing and disclosing devices, compositions, formulations andmethodologies which are described in the publication and which might beused in connection with the presently described invention.

Where a range of values is provided, it is understood that eachintervening value, to the tenth of the unit of the lower limit unlessthe context clearly dictates otherwise, between the upper and lowerlimit of that range and any other stated or intervening value in thatstated range is encompassed within the invention. The upper and lowerlimits of these smaller ranges may independently be included in thesmaller ranges is also encompassed within the invention, subject to anyspecifically excluded limit in the stated range. Where the stated rangeincludes one or both of the limits, ranges excluding either both ofthose included limits are also included in the invention.

In the following description, numerous specific details are set forth toprovide a more thorough understanding of the present invention. However,it will be apparent to one of skill in the art that the presentinvention may be practiced without one or more of these specificdetails. In other instances, well-known features and procedures wellknown to those skilled in the art have not been described in order toavoid obscuring the invention.

As used herein, the term “comprising” is intended to mean that thecompositions and methods include the recited elements, but not excludingothers. “Consisting essentially of” when used to define compositions andmethods, shall mean excluding other elements of any essentialsignificance to the composition or method. “Consisting of” shall meanexcluding more than trace elements of other ingredients for claimedcompositions and substantial method steps. Embodiments defined by eachof these transition terms are within the scope of this invention.Accordingly, it is intended that the methods and compositions caninclude additional steps and components (comprising) or alternativelyincluding steps and compositions of no significance (consistingessentially of) or alternatively, intending only the stated method stepsor compositions (consisting of).

All numerical designations, e.g., pH, temperature, time, concentration,and molecular weight, including ranges, are approximations which arevaried (+) or (−) by increments of 0.1. It is to be understood, althoughnot always explicitly stated that all numerical designations arepreceded by the term “about”. The term “about” also includes the exactvalue “X” in addition to minor increments of “X” such as “X+0.1” or“X−0.1.” It also is to be understood, although not always explicitlystated, that the reagents described herein are merely exemplary and thatequivalents of such are known in the art.

1. Overview

This disclosure provides methods, compositions and systems useful forcharacterization of genetic material. In particular, the methods,compositions and systems described herein provide geneticcharacterization of targeted regions of a genome, including withoutlimitation particular chromosomes, regions of chromosomes, all exons(exomes), portions of exomes, specific genes, panels of genes (e.g.,kinomes or other targeted gene panels), intronic regions, tiled portionsof a genome, or any other chosen portion of a genome.

In general, the methods and systems described herein accomplish targetedgenomic sequencing by providing for the determination of the sequence oflong individual nucleic acid molecules and/or the identification ofdirect molecular linkage as between two sequence segments separated bylong stretches of sequence, which permit the identification and use oflong range sequence information, but this sequencing information isobtained using methods that have the advantages of the extremely lowsequencing error rates and high throughput of short read sequencingtechnologies. The methods and systems described herein segment longnucleic acid molecules into smaller fragments that can be sequencedusing high-throughput, higher accuracy short-read sequencingtechnologies, and that segmentation is accomplished in a manner thatallows the sequence information derived from the smaller fragments toretain the original long range molecular sequence context, i.e.,allowing the attribution of shorter sequence reads to originating longerindividual nucleic acid molecules. By attributing sequence reads to anoriginating longer nucleic acid molecule, one can gain significantcharacterization information for that longer nucleic acid sequence thatone cannot generally obtain from short sequence reads alone. This longrange molecular context is not only preserved through a sequencingprocess, but is also preserved through the targeted enrichment processused in targeted sequencing approaches described herein, where no othersequencing approach has shown this ability.

In general, sequence information from smaller fragments will retain theoriginal long range molecular sequence context through the use of atagging procedure, including the addition of barcodes as describedherein and known in the art. In specific examples, fragments originatingfrom the same original longer individual nucleic acid molecule will betagged with a common barcode, such that any later sequence reads fromthose fragments can be attributed to that originating longer individualnucleic acid molecule. Such barcodes can be added using any method knownin the art, including addition of barcode sequences during amplificationmethods that amplify segments of the individual nucleic acid moleculesas well as insertion of barcodes into the original individual nucleicacid molecules using transposons, including methods such as thosedescribed in Amini et al., Nature Genetics 46: 1343-1349 (2014) (advanceonline publication on Oct. 29, 2014), which is hereby incorporated byreference in its entirety for all purposes and in particular for allteachings related to adding adaptor and other oligonucleotides usingtransposons. Once nucleic acids have been tagged using such methods, theresultant tagged fragments can be enriched using methods describedherein such that the population of fragments represents targeted regionsof the genome. As such, sequence reads from that population allows fortargeted sequencing of select regions of the genome, and those sequencereads can also be attributed to the originating nucleic acid molecules,thus preserving the original long range molecular sequence context. Thesequence reads can be obtained using any sequencing methods andplatforms known in the art and described herein.

In addition to providing the ability to obtain sequence information fromtargeted regions of the genome, the methods and systems described hereincan also provide other characterizations of genomic material, includingwithout limitation haplotype phasing, identification of structuralvariations, and identifying copy number variations, as described inco-pending applications U.S. Ser. Nos. 14/752,589 and 14/752,602, bothfiled on Jun. 26, 2015), which are herein incorporated by reference intheir entirety for all purposes and in particular for all writtendescription, figures and working examples directed to characterizationof genomic material.

Methods of processing and sequencing nucleic acids in accordance withthe methods and systems described in the present application are alsodescribed in further detail in U.S. Ser. Nos. 14/316,383; 14/316,398;14/316,416; 14/316,431; 14/316,447; and 14/316,463 which are hereinincorporated by reference in their entirety for all purposes and inparticular for all written description, figures and working examplesdirected to processing nucleic acids and sequencing and othercharacterizations of genomic material.

In general, as shown in FIG. 1, the methods and systems described hereinmay be used to characterize nucleic acids. In particular, as shown, twodiscrete individual nucleic acids 102 and 104 are illustrated, eachhaving a number of regions of interest, e.g., region 106 and 108 innucleic acid 102, and regions 110 and 112 in nucleic acid 104. Theregions of interest in each nucleic acid are linked within the samenucleic acid molecule, but may be relatively separated from each other,e.g., more than 1 kb apart, more than 5 kb apart, more than 10 kb apart,more than 20 kb apart, more than 30 kb apart, more than 40 kb apart,more than 50 kb apart, and in some cases, as much as 100 kb apart. Theregions may denote individual genes, gene groups, exons, or simplydiscrete and separate parts of the genome. Solely for ease ofdiscussion, the regions shown in FIG. 1 will be referred to as exons106, 108, 110 and 112. As shown, each nucleic acid 102 and 104 isseparated into its own partition 114 and 116, respectively. As notedelsewhere herein, these partitions are, in many cases, aqueous dropletsin a water in oil emulsion. Within each droplet, portions of eachfragment are copied in a manner that preserves the original molecularcontext of those fragments, e.g., as having originated from the samemolecule. As shown, this is achieved through the inclusion in eachcopied fragment of a barcode sequence, e.g., barcode sequence “1” or “2”as illustrated, that is representative of the droplet into which theoriginating fragment was partitioned. For whole genome sequence analysisapplications, one could simply pool all of the copied fragments andtheir associated barcodes, in order to sequence and reassemble the fullrange sequence information from each of the originating nucleic acids102 and 104. However, in many cases, it is more desirable to onlyanalyze specific targeted portions of the overall genome, e.g., theexome, specific genes, or the like, in order to provide greater focus onscientifically relevant portions of the genome, and to minimize the timeand expense of performing sequencing on less relevant or irrelevantportions of the genome.

In accordance with the methods described herein, target enrichment stepsmay be applied to the libraries of barcoded sequence fragments in orderto “pull down” the sequences associated with the desired targets. Thesemay include exon targeted pull downs, gene panel specific targeted pulldowns, or the like. A large number of targeted pull down kits that allowfor the enriched separation of specific targeted regions of the genomeare commercially available, such as the Agilent SureSelect exome pulldown kits, and the like. As shown in FIG. 1, application of a targetedenrichment results in enriched, barcoded sequence library 118. Further,because the pulled down fragments within library 118 retain theiroriginal molecular context, e.g., through the retention of the barcodeinformation, they may be reassembled into their original molecularcontexts with embedded long range linkage information, e.g., withinferred linkage as between each of the assembled regions of interest106:108 and 110:112. By way of example, one may identify directmolecular linkage between two disparate targeted portions of the genome,e.g., two or more exons, and that direct molecular linkage may be usedto identify structural variations and other genomic characteristics, aswell as to identify the phase information as to the two or more exons,e.g. providing phased exons, including potentially an entire phasedexome, or other phased targeted portions of a genome.

Generally, methods of the invention include steps as illustrated in FIG.7, which provides a schematic overview of methods of the inventiondiscussed in further detail herein. As will be appreciated, the methodoutlined in FIG. 9 is an exemplary embodiment that may be altered ormodified as needed and as described herein.

As shown in FIG. 7, the methods described herein will in most examplesinclude a step in which sample nucleic acids containing the targetedregions of interest are partitioned (701). Generally, each partitionwill include a single individual nucleic acid molecule from a particularlocus that is then fragmented or copied in such a way as to preserve theoriginal molecular context of the fragments (702), usually by barcodingthe fragments that are specific to the partition in which they arecontained. Each partition may in some examples include more than onenucleic acid, and will in some instances contain several hundred nucleicacid molecules—in situations in which multiple nucleic acids are withina partition, any particular locus of the genome will generally berepresented by a single individual nucleic acid prior to barcoding. Thebarcoded fragments of step 702 can be generated using any methods knownin the art—in some examples, oligonucleotides are the samples within thedistinct partitions. Such oligonucleotides may comprise random sequencesintended to randomly prime numerous different regions of the samples, orthey may comprise a specific primer sequence targeted to prime upstreamof a targeted region of the sample. In further examples, theseoligonucleotides also contain a barcode sequence, such that thereplication process also barcodes the resultant replicated fragment ofthe original sample nucleic acid. A particularly elegant process for useof these barcode oligonucleotides in amplifying and barcoding samples isdescribed in detail in U.S. patent application Ser. Nos. 14/316,383,14/316,398, 14/316,416, 14/316,431, 14/316,447, 14/316,463, all filedJun. 26, 2014, each of which is herein incorporated by reference in itsentirety for all purposes. Extension reaction reagents, e.g., DNApolymerase, nucleoside triphosphates, co-factors (e.g., Mg²⁺ or Mn²⁺etc.), that are also contained in the partitions, then extend the primersequence using the sample as a template, to produce a complementaryfragment to the strand of the template to which the primer annealed, andthe complementary fragment includes the oligonucleotide and itsassociated barcode sequence. Annealing and extension of multiple primersto different portions of the sample can result in a large pool ofoverlapping complementary fragments of the sample, each possessing itsown barcode sequence indicative of the partition in which it wascreated. In some cases, these complementary fragments may themselves beused as a template primed by the oligonucleotides present in thepartition to produce a complement of the complement that again, includesthe barcode sequence. In further examples, this replication process isconfigured such that when the first complement is duplicated, itproduces two complementary sequences at or near its termini to allow theformation of a hairpin structure or partial hairpin structure, whichreduces the ability of the molecule to be the basis for producingfurther iterative copies.

Returning to the method exemplified in FIG. 7, once thepartition-specific barcodes are attached to the copied fragments, thebarcoded fragments are then pooled (703). Target enrichment techniquescan then be applied (704) to “pull down” the targeted regions ofinterest. Those targeted regions of interest are then sequenced (705)and the sequences of the fragments are attributed to their originatingmolecular context (706), such that the targeted regions of interest areboth identified and also linked with that originating molecular context.A unique feature of the methods and systems described herein andillustrated in FIG. 7 is that barcodes are attached to the fragments(702) prior to the targeted enrichment step (704). An advantage of themethods and systems described herein is that attaching a partition- orsample-specific barcode to the copied fragments prior to enriching thefragments for targeted genomic regions preserves the original molecularcontext of those targeted regions, allowing them to be attributed totheir original partition and thus their originating sample nucleic acid.

In general, targeted genomic regions are enriched, isolated orseparated, i.e., “pulled down,” for further analysis, particularlysequencing, using methods that include both chip-based andsolution-based capture methods. Such methods utilize probes that arecomplementary to the genomic regions of interest or to regions near oradjacent to the genomic regions of interest. For example, in hybrid (orchip-based) capture, microarrays containing capture probes (usuallysingle-stranded oligonucleotides) with sequences that taken togethercover the region of interest are fixed to a surface. Genomic DNA isfragmented and may further undergo processing such as end-repair toproduce blunt ends and/or addition of additional features such asuniversal priming sequences. These fragments are hybridized to theprobes on the microarray. Unhybridized fragments are washed away and thedesired fragments are eluted or otherwise processed on the surface forsequencing or other analysis, and thus the population of fragmentsremaining on the surface is enriched for fragments containing thetargeted regions of interest (e.g., the regions comprising the sequencescomplementary to those contained in the capture probes). The enrichedpopulation of fragments may further be amplified using any amplificationtechnologies known in the art.

Additional methods of targeted genomic region capture includesolution-based methods, in which genomic DNA fragments are hybridized tooligonucleotide probes. The oligonucleotide probes are often referred toas “baits”. These baits are generally attached to a capture molecule,including without limitation a biotin molecule. The baits arecomplementary to targeted regions of the genome (or to regions near oradjacent to the targeted regions of interest), such that uponapplication to genomic DNA fragments, the baits hybridize to thefragments, and the capture molecule (e.g., biotin) is then used toselectively pull down the targeted regions of interest (for example,with magnetic streptavidin beads) to thereby enrich the resultantpopulation of fragments with those containing the targeted regions ofinterest.

In examples in which targeted regions covering the whole exome areneeded, a library of baits that together cover the whole exome is usedto capture those targeted sequences. In such examples, capture protocolscan include any of those known in the art, including without limitationany of the exome capture protocols and kits produced by Roche/NimbleGen,IIlumina, and Agilent.

Capture of targeted genomic regions for use in the methods and systemsdescribed herein are not limited to whole exomes, and can include anyone or combination of partial exomes, genes, panels of genes, introns,and combinations of introns and exons. The procedure for capture ofthese different types of targeted regions follows the general method ofusing baits to pull down fragments containing the targeted regions ofinterest. The design of the baits, particularly the oligonucleotideprobe portions of the baits that hybridize to or near to the targetedregions of interest, will in part depend on the type of targeted regionto be captured.

In examples in which only a partial exome is needed for furtheranalysis, the baits can be designed to capture that part of the exome.In certain examples, the specific identities of the portions of theexome that are needed are known, and the library of baits comprisesoligonucleotides that are complementary to those identified portions orto regions that are near or adjacent to those portions. Such examplescan further include without limitation capture of specific genes and/orpanels of genes, or identified portions of the exome known to beassociated with a particular phenotype, such as a disorder or disease.In some examples, it may be that a certain portion of the exome or thewhole genome (including both intronic and exonic regions) is needed forfurther analysis, but the specific sequences for the portions of thegenome to be captured are not known. In such embodiments, the baits usedcan be subsets of a library directed to a whole genome, and that subsetcan be chosen randomly or through any kind of intelligent design inwhich the library of baits is selected or enriched for probes that arecomplementary to the targeted subsections of the genome or exome.

For any of the methods described herein, the targeted regions can becaptured using baits that comprise oligonucleotide probes that arecomplementary to the whole or part of a targeted region, or theoligonucleotide probes may be complementary to another region, e.g., anintronic region, that is near the targeted region or adjacent to thetargeted region. For example, as schematically illustrated in FIG. 2A, agenomic sequence 201 comprises exonic regions 202 and 203. Those exonicregions can be captured by directing the baits to one or more of theintronic sequences nearby (for example intronic region 204 and/or 205 tocapture exonic region 202 and intronic region 206 for capture of exonicregion 203). In other words, a population of fragments comprising exonicregions 202 or 203 can be captured through the use of baitscomplementary to intronic regions 204 and/or 205 and 206. As shown inFIG. 2A, the intronic region used as an intronic bait for the nearbyexonic region can be adjacent to the exonic region of interest—i.e.,there is no gap between the intronic region and the targeted exonicregion. In other examples, the intronic region used to capture thenearby exonic region may be near enough so that both regions are likelyto be in the same fragment, but there is a gap of one or morenucleotides between the exonic region and the intronic region (forexample 202 and 205 in FIG. 2A).

In some examples, rather than designing the baits to target particularregions of the genome, a tiling approach is used. In such an approach,rather than targeting specific exonic or intronic regions, the baits aredesigned to be complementary to portions of the genome at particularranges or distances. For example, the library of baits can be designedto cover sequences every 5 kilobases (kb) along the genome, such thatapplying this library of baits to a fragmented genomic sample willcapture only a certain subset of the genome—i.e., those regions that arecontained in fragments containing complementary sequences to the baits.As will be appreciated, the baits can be designed based on a referencesequence, such as a human genome reference sequence. In furtherexamples, the tiled library of baits is designed to capture regionsevery 1, 2, 5, 10, 15, 20, 25, 50, 100, 200, 250, 500, 750, 1000, or10000 kilobases of a genome. In still further examples, the tiledlibrary of baits is designed to capture a mixture of distances—thatmixture can be a random mixture of distances or intelligently designedsuch that a specific portion or percentage of the genome is captured. Aswill be appreciated, such tiling methods of capture will capture bothintronic and exonic regions of the genome for further analysis such assequencing. Any of the tiling or other intronic baiting methodsdescribed herein provide a way to link sequence information from exonswidely separated by long intervening intronic regions.

In further examples, the tiling or other capture methods describedherein will capture about 5%, 10%, 15%, 20%, 30%, 40%, 50%, 60%, 70%,80%, 90%, 95% of the whole genome. In still further examples, thecapture methods described herein capture about 1-10%, 5-20%, 10-30%,15-40%, 20-50%, 25-60%, 30-70%, 35-80%, 40-90%, or 45-95% of the wholegenome.

In some examples, sample preparation methods, including methods offragmenting, amplifying, partitioning, and otherwise processing genomicDNA, can lead to biases or lower coverage of certain regions of agenome. Such biases or lowered coverage can be compensated for in themethods and systems disclosed herein by altering the concentration orgenomic locations of baits used to capture targeted regions of thegenome. In some examples, it may be known that certain regions of thegenome containing high GC content or other structural variations willlead to low coverage—in such situations, the library of baits can bealtered to increase the concentration of baits directed to those regionsof low coverage—in other words, the population of baits used may be“spiked” to ensure that a sufficient number of fragments containingtargeted regions of the genome in those low coverage areas are obtainedin the final population of fragments to be sequenced. Such spiking ofbaits may be conducted in commercially available whole exome kits, suchthat a custom library of baits directed toward the lower coverageregions are added to off-the-shelf exome capture kits. Additionally,baits can be design to target a region of the genome that is very closeto the region of interest, but has more favorable coverage, as is alsodiscussed in further detail herein and embodiments of which areschematically illustrated in FIG. 2.

In further examples, the library of baits used in methods of the presentinvention is a product of informed design that fulfills one or morecharacteristics as further described herein. This informed designincludes instances in which the library of baits is directed toinformative single nucleotide polymorphisms (SNPs). The term“informative SNPs” as used herein refers to SNPs that are heterozygous.The library of baits in some examples is designed to contain a pluralityof probes that are directed to regions of the genomic sample thatcontain informative SNPs. By “directed to” as used herein is meant thatthe probes contain sequences that are complementary to sequences thatencompass the SNPs. In further examples, the library of baits isdesigned to contain probes directed to SNPs that are at predetermineddistances from the boundary of an exon and an intron. In situations inwhich the targeted regions of the genome include regions that are devoidof or contain very few SNPs, the library of baits includes probes thattile across such regions at a predetermined distance and/or thathybridize to the first informative SNP within the next nearest intron orexon.

An advantage of the methods and systems described herein is that thetargeted regions that are captured are processed prior to capture insuch a way that even after the steps of capturing the targeted regionsand conducting sequencing analyses, the original molecular context ofthose targeted regions is retained. As is discussed in further detailherein, the ability to attribute specific targeted regions to theiroriginal molecular context (which can include the original chromosome orchromosomal region from which they are derived and/or the location ofparticular targeted regions in relation to each other within the fullgenome) provides a way to obtain sequence information from regions ofthe genome that are otherwise poorly mapped or have poor coverage usingtraditional sequencing techniques.

For example, some genes possess long introns that are too long to spanusing generally available sequencing techniques, particularly usingshort-read technologies that possess superior accuracy as compared tolong-read technologies. In the methods and systems described herein,however, the molecular context of targeted regions is retained,generally through the tagging procedure illustrated in FIG. 1 anddescribed in further detail herein. As such, links can be made acrossextended regions of the genome. For example, as schematicallyillustrated in FIG. 2B, nucleic acid molecule 207 contains two exons(shaded bars) interrupted by a long intronic region (208). Generallyused sequencing technologies would be unable to span the distance acrossthe intron to provide information on the relationship of the two exons.In the methods described herein, the individual nucleic acid molecule207 is distributed into its own discrete partition 209 and thenfragmented such that different fragments contain different portions ofthe exons and the intron. Because each of those fragments is tagged suchthat any sequence information obtained from the fragments is thenattributable to the discrete partition in which it was generated, eachfragment is thus also attributable to the individual nucleic acidmolecule 207 from which it was derived. In general, and as is describedin further detail herein, after fragmentation and tagging, fragmentsfrom different partitions are combined together. Targeted capturemethods can then be used to enrich the population of fragments thatundergoes further analysis, such as sequencing, with fragmentscontaining the targeted region of interest. In the example illustratedin FIG. 2B, the baits used will enrich the population of fragments tocapture only those containing a portion of one of the two exons and/orpart of the intervening intron, but regions outside of the exons andintron (such as 209 and 210) would not be captured. Thus, the finalpopulation of fragments that undergoes sequencing will be enriched forthe fragments containing portions of the two exons of interest. Shortread, high accuracy sequencing technologies can then be used to identifythe sequences of this enriched population of fragments, and because eachof the fragments is tagged and thus attributable to its originalmolecular context, i.e., its original individual nucleic acid molecule,the short read sequences can provide information that spans over thelong length of the intervening intron to provide information on therelationship between the two exons.

As noted above, the methods and systems described herein provideindividual molecular context for short sequence reads of longer nucleicacids. As used herein, individual molecular context refers to sequencecontext beyond the specific sequence read, e.g., relation to adjacent orproximal sequences, that are not included within the sequence readitself, and as such, will typically be such that they would not beincluded in whole or in part in a short sequence read, e.g., a read ofabout 150 bases, or about 300 bases for paired reads. In particularlypreferred aspects, the methods and systems provide long range sequencecontext for short sequence reads. Such long range context includesrelationship or linkage of a given sequence read to sequence reads thatare within a distance of each other of longer than 1 kb, longer than 5kb, longer than 10 kb, longer than 15 kb, longer than 20 kb, longer than30 kb, longer than 40 kb, longer than 50 kb, longer than 60 kb, longerthan 70 kb, longer than 80 kb, longer than 90 kb or even longer than 100kb, or longer. As will be appreciated, by providing long rangeindividual molecular context, one can also derive the phasinginformation of variants within that individual molecular context, e.g.,variants on a particular long molecule will be, by definition commonlyphased.

By providing longer range individual molecular context, the methods andsystems of the invention also provide much longer inferred molecularcontext (also referred to herein as a “long virtual single moleculeread”). Sequence context, as described herein can include mapping orproviding linkage of fragments across different (generally on thekilobase scale) ranges of full genomic sequence. These methods includemapping the short sequence reads to the individual longer molecules orcontigs of linked molecules, as well as long range sequencing of largeportions of the longer individual molecules, e.g., having contiguousdetermined sequences of individual molecules where such determinedsequences are longer than 1 kb, longer than 5 kb, longer than 10 kb,longer than 15 kb, longer than 20 kb, longer than 30 kb, longer than 40kb, longer than 50 kb, longer than 60 kb, longer than 70 kb, longer than80 kb, longer than 90 kb or even longer than 100 kb. As with sequencecontext, the attribution of short sequences to longer nucleic acids,e.g., both individual long nucleic acid molecules or collections oflinked nucleic acid molecules or contigs, may include both mapping ofshort sequences against longer nucleic acid stretches to provide highlevel sequence context, as well as providing assembled sequences fromthe short sequences through these longer nucleic acids.

Furthermore, while one may utilize the long range sequence contextassociated with long individual molecules, having such long rangesequence context also allows one to infer even longer range sequencecontext. By way of one example, by providing the long range molecularcontext described above, one can identify overlapping variant portions,e.g., phased variants, translocated sequences, etc., among longsequences from different originating molecules, allowing the inferredlinkage between those molecules. Such inferred linkages or molecularcontexts are referred to herein as “inferred contigs”. In some caseswhen discussed in the context of phased sequences, the inferred contigsmay represent commonly phased sequences, e.g., where by virtue ofoverlapping phased variants, one can infer a phased contig ofsubstantially greater length than the individual originating molecules.These phased contigs are referred to herein as “phase blocks”.

By starting with longer single molecule reads (e.g., the “long virtualsingle molecule reads” discussed above), one can derive longer inferredcontigs or phase blocks than would otherwise be attainable using shortread sequencing technologies or other approaches to phased sequencing.See, e.g., published U.S. Patent Application No. 2013-0157870. Inparticular, using the methods and systems described herein, one canobtain inferred contig or phase block lengths having an N50 (where thesum of the block lengths that are greater than the stated N50 number is50% of the sum of all block lengths) of at least about 10 kb, at leastabout 20 kb, at least about 50 kb. In more preferred aspects, inferredcontig or phase block lengths having an N50 of at least about 100 kb, atleast about 150 kb, at least about 200 kb, and in many cases, at leastabout 250 kb, at least about 300 kb, at least about 350 kb, at leastabout 400 kb, and in some cases, at least about 500 kb or more, areattained. In still other cases, maximum phase block lengths in excess of200 kb, in excess of 300 kb, in excess of 400 kb, in excess of 500 kb,in excess of 1 Mb, or even in excess of 2 Mb may be obtained.

In one aspect, and in conjunction with any of the capture methodsdescribed above and later herein, the methods and systems describedherein provide for the compartmentalization, depositing or partitioningof sample nucleic acids, or fragments thereof, into discretecompartments or partitions (referred to interchangeably herein aspartitions), where each partition maintains separation of its owncontents from the contents of other partitions. Unique identifiers,e.g., barcodes, may be previously, subsequently or concurrentlydelivered to the partitions that hold the compartmentalized orpartitioned sample nucleic acids, in order to allow for the laterattribution of the characteristics, e.g., nucleic acid sequenceinformation, to the sample nucleic acids included within a particularcompartment, and particularly to relatively long stretches of contiguoussample nucleic acids that may be originally deposited into thepartitions.

The sample nucleic acids utilized in the methods described hereintypically represent a number of overlapping portions of the overallsample to be analyzed, e.g., an entire chromosome, exome, or other largegenomic portion. These sample nucleic acids may include whole genomes,individual chromosomes, exomes, amplicons, or any of a variety ofdifferent nucleic acids of interest. The sample nucleic acids aretypically partitioned such that the nucleic acids are present in thepartitions in relatively long fragments or stretches of contiguousnucleic acid molecules. Typically, these fragments of the sample nucleicacids may be longer than 1 kb, longer than 5 kb, longer than 10 kb,longer than 15 kb, longer than 20 kb, longer than 30 kb, longer than 40kb, longer than 50 kb, longer than 60 kb, longer than 70 kb, longer than80 kb, longer than 90 kb or even longer than 100 kb, which permits thelonger range molecular context described above.

The sample nucleic acids are also typically partitioned at a levelwhereby a given partition has a very low probability of including twooverlapping fragments of the starting sample nucleic acid. This istypically accomplished by providing the sample nucleic acid at a lowinput amount and/or concentration during the partitioning process. As aresult, in preferred cases, a given partition may include a number oflong, but non-overlapping fragments of the starting sample nucleicacids. The sample nucleic acids in the different partitions are thenassociated with unique identifiers, where for any given partition,nucleic acids contained therein possess the same unique identifier, butwhere different partitions may include different unique identifiers.Moreover, because the partitioning step allocates the sample componentsinto very small volume partitions or droplets, it will be appreciatedthat in order to achieve the desired allocation as set forth above, oneneed not conduct substantial dilution of the sample, as would berequired in higher volume processes, e.g., in tubes, or wells of amultiwell plate. Further, because the systems described herein employsuch high levels of barcode diversity, one can allocate diverse barcodesamong higher numbers of genomic equivalents, as provided above. Inparticular, previously described, multiwell plate approaches (see, e.g.,U.S. Published Application No. 2013-0079231 and 2013-0157870) typicallyonly operate with a hundred to a few hundred different barcodesequences, and employ a limiting dilution process of their sample inorder to be able to attribute barcodes to different cells/nucleic acids.As such, they will generally operate with far fewer than 100 cells,which would typically provide a ratio of genomes:(barcode type) on theorder of 1:10, and certainly well above 1:100. The systems describedherein, on the other hand, because of the high level of barcodediversity, e.g., in excess of 10,000, 100,000, 500,000, etc. diversebarcode types, can operate at genome:(barcode type) ratios that are onthe order of 1:50 or less, 1:100 or less, 1:1000 or less, or evensmaller ratios, while also allowing for loading higher numbers ofgenomes (e.g., on the order of greater than 100 genomes per assay,greater than 500 genomes per assay, 1000 genomes per assay, or evenmore) while still providing for far improved barcode diversity pergenome.

Often, the sample is combined with a set of oligonucleotide tags thatare releasably-attached to beads prior to the partitioning step. Thatcombination can then lead to barcoding of nucleic acids in the samplesusing methods known in the art and described herein. In some examples,amplification methods are used to add barcodes to the resultantamplification products, which in some examples contain smaller segments(fragments) of the full originating nucleic acid molecule from whichthey are derived. In some examples, methods using transposons areutilized as described in Amini et al, Nature Genetics 46: 1343-1349(2014) (advance online publication on Oct. 29, 2014), which is hereinincorporated by reference in its entirety for all purposes and inparticular for all teachings related to attaching barcodes or otheroligonucleotide tags to nucleic acids. In further examples, methods ofattaching barcodes can include the use of nicking enzymes or polymerasesand/or invasive probes such as recA to produce gaps along doublestranded sample nucleic acids—barcodes can then be inserted into thosegaps.

In examples in which amplification is used to tag nucleic acidfragments, the oligonucleotide tags may comprise at least a first andsecond region. The first region may be a barcode region that, as betweenoligonucleotides within a given partition, may be substantially the samebarcode sequence, but as between different partitions, may and, in mostcases is a different barcode sequence. The second region may be an N-mer(either a random N-mer or an N-mer designed to target a particularsequence) that can be used to prime the nucleic acids within the samplewithin the partitions. In some cases, where the N-mer is designed totarget a particular sequence, it may be designed to target a particularchromosome (e.g., chromosome 1, 13, 18, or 21), or region of achromosome, e.g., an exome or other targeted region. In some cases, theN-mer may be designed to target a particular gene or genetic region,such as a gene or region associated with a disease or disorder (e.g.,cancer). Within the partitions, an amplification reaction may beconducted using the second N-mer to prime the nucleic acid sample atdifferent places along the length of the nucleic acid. As a result ofthe amplification, each partition may contain amplified products of thenucleic acid that are attached to an identical or near-identicalbarcode, and that may represent overlapping, smaller fragments of thenucleic acids in each partition. The bar-code can serve as a marker thatsignifies that a set of nucleic acids originated from the samepartition, and thus potentially also originated from the same strand ofnucleic acid. Following amplification, the nucleic acids may be pooled,sequenced, and aligned using a sequencing algorithm. Because shortersequence reads may, by virtue of their associated barcode sequences, bealigned and attributed to a single, long fragment of the sample nucleicacid, all of the identified variants on that sequence can be attributedto a single originating fragment and single originating chromosome.Further, by aligning multiple co-located variants across multiple longfragments, one can further characterize that chromosomal contribution.Accordingly, conclusions regarding the phasing of particular geneticvariants may then be drawn, as can analyses across long ranges ofgenomic sequence—for example, identification of sequence informationacross stretches of poorly characterized regions of the genome. Suchinformation may also be useful for identifying haplotypes, which aregenerally a specified set of genetic variants that reside on the samenucleic acid strand or on different nucleic acid strands. Copy numbervariations may also be identified in this manner.

The described methods and systems provide significant advantages overcurrent nucleic acid sequencing technologies and their associated samplepreparation methods. Ensemble sample preparation and sequencing methodsare predisposed towards primarily identifying and characterizing themajority constituents in the sample, and are not designed to identifyand characterize minority constituents, e.g., genetic materialcontributed by one chromosome, or by one or a few cells, or fragmentedtumor cell DNA molecule circulating in the bloodstream, that constitutea small percentage of the total DNA in the extracted sample. Thedescribed methods and systems also provide a significant advantage fordetecting populations that are present within a larger sample. As such,they are particularly useful for assessing haplotype and copy numbervariations—the methods disclosed herein are also useful for providingsequence information over regions of the genome that are poorlycharacterized or are poorly represented in a population of nucleic acidtargets due to biases introduced during sample preparation.

The use of the barcoding technique disclosed herein confers the uniquecapability of providing individual molecular context for a given set ofgenetic markers, i.e., attributing a given set of genetic markers (asopposed to a single marker) to individual sample nucleic acid molecules,and through variant coordinated assembly, to provide a broader or evenlonger range inferred individual molecular context, among multiplesample nucleic acid molecules, and/or to a specific chromosome. Thesegenetic markers may include specific genetic loci, e.g., variants, suchas SNPs, or they may include short sequences. Furthermore, the use ofbarcoding confers the additional advantages of facilitating the abilityto discriminate between minority constituents and majority constituentsof the total nucleic acid population extracted from the sample, e.g. fordetection and characterization of circulating tumor DNA in thebloodstream, and also reduces or eliminates amplification bias duringoptional amplification steps. In addition, implementation in amicrofluidics format confers the ability to work with extremely smallsample volumes and low input quantities of DNA, as well as the abilityto rapidly process large numbers of sample partitions (droplets) tofacilitate genome-wide tagging.

As described previously, an advantage of the methods and systemsdescribed herein is that they can achieve the desired results throughthe use of ubiquitously available, short read sequencing technologies.Such technologies have the advantages of being readily available andwidely dispersed within the research community, with protocols andreagent systems that are well characterized and highly effective. Theseshort read sequencing technologies include those available from, e.g.,IIlumina, inc. (GXII, NextSeq, MiSeq, HiSeq, X10), Ion Torrent divisionof Thermo-Fisher (Ion Proton and Ion PGM), pyrosequencing methods, aswell as others.

Of particular advantage is that the methods and systems described hereinutilize these short read sequencing technologies and do so with theirassociated low error rates. In particular, the methods and systemsdescribed herein achieve the desired individual molecular readlengths orcontext, as described above, but with individual sequencing reads,excluding mate pair extensions, that are shorter than 1000 bp, shorterthan 500 bp, shorter than 300 bp, shorter than 200 bp, shorter than 150by or even shorter; and with sequencing error rates for such individualmolecular readlengths that are less than 5%, less than 1%, less than0.5%, less than 0.1%, less than 0.05%, less than 0.01%, less than0.005%, or even less than 0.001%.

II. Work Flow Overview

In one exemplary aspect, the methods and systems described in thedisclosure provide for depositing or partitioning individual samples(e.g., nucleic acids) into discrete partitions, where each partitionmaintains separation of its own contents from the contents in otherpartitions. As used herein, the partitions refer to containers orvessels that may include a variety of different forms, e.g., wells,tubes, micro or nanowells, through holes, or the like. In preferredaspects, however, the partitions are flowable within fluid streams.These vessels may be comprised of, e.g., microcapsules or micro-vesiclesthat have an outer barrier surrounding an inner fluid center or core, orthey may be a porous matrix that is capable of entraining and/orretaining materials within its matrix. In preferred aspect, however,these partitions may comprise droplets of aqueous fluid within anon-aqueous continuous phase, e.g., an oil phase. A variety of differentvessels are described in, for example, U.S. patent application Ser. No.13/966,150, filed Aug. 13, 2013. Likewise, emulsion systems for creatingstable droplets in non-aqueous or oil continuous phases are described indetail in, e.g., Published U.S. Patent Application No. 2010-0105112. Incertain cases, microfluidic channel networks are particularly suited forgenerating partitions as described herein. Examples of such microfluidicdevices include those described in detail in U.S. patent applicationSer. No. 14/682,952, filed Apr. 9, 2015, the full disclosure of which isincorporated herein by reference in its entirety for all purposes.Alternative mechanisms may also be employed in the partitioning ofindividual cells, including porous membranes through which aqueousmixtures of cells are extruded into non-aqueous fluids. Such systems aregenerally available from, e.g., Nanomi, Inc.

In the case of droplets in an emulsion, partitioning of samplematerials, e.g., nucleic acids, into discrete partitions may generallybe accomplished by flowing an aqueous, sample containing stream, into ajunction into which is also flowing a non-aqueous stream of partitioningfluid, e.g., a fluorinated oil, such that aqueous droplets are createdwithin the flowing stream partitioning fluid, where such dropletsinclude the sample materials. As described below, the partitions, e.g.,droplets, also typically include co-partitioned barcodeoligonucleotides. The relative amount of sample materials within anyparticular partition may be adjusted by controlling a variety ofdifferent parameters of the system, including, for example, theconcentration of sample in the aqueous stream, the flow rate of theaqueous stream and/or the non-aqueous stream, and the like. Thepartitions described herein are often characterized by having extremelysmall volumes. For example, in the case of droplet based partitions, thedroplets may have overall volumes that are less than 1000 pL, less than900 pL, less than 800 pL, less than 700 pL, less than 600 pL, less than500 pL, less than 400 pL, less than 300 pL, less than 200 pL, less than100 pL, less than 50 pL, less than 20 pL, less than 10 pL, or even lessthan 1 pL. Where co-partitioned with beads, it will be appreciated thatthe sample fluid volume within the partitions may be less than 90% ofthe above described volumes, less than 80%, less than 70%, less than60%, less than 50%, less than 40%, less than 30%, less than 20%, or evenless than 10% the above described volumes. In some cases, the use of lowreaction volume partitions is particularly advantageous in performingreactions with very small amounts of starting reagents, e.g., inputnucleic acids. Methods and systems for analyzing samples with low inputnucleic acids are presented in U.S. patent application Ser. No.14/752,602, filed Jun. 26, 2015, the full disclosure of which is herebyincorporated by reference in its entirety.

Once the samples are introduced into their respective partitions, inaccordance with the methods and systems described herein, the samplenucleic acids within partitions are generally provided with uniqueidentifiers such that, upon characterization of those nucleic acids theymay be attributed as having been derived from their respective origins.Accordingly, the sample nucleic acids are typically co-partitioned withthe unique identifiers (e.g., barcode sequences). In particularlypreferred aspects, the unique identifiers are provided in the form ofoligonucleotides that comprise nucleic acid barcode sequences that maybe attached to those samples. The oligonucleotides are partitioned suchthat as between oligonucleotides in a given partition, the nucleic acidbarcode sequences contained therein are the same, but as betweendifferent partitions, the oligonucleotides can, and preferably havediffering barcode sequences. In preferred aspects, only one nucleic acidbarcode sequence will be associated with a given partition, although insome cases, two or more different barcode sequences may be present.

The nucleic acid barcode sequences will typically include from 6 toabout 20 or more nucleotides within the sequence of theoligonucleotides. These nucleotides may be completely contiguous, i.e.,in a single stretch of adjacent nucleotides, or they may be separatedinto two or more separate subsequences that are separated by one or morenucleotides. Typically, separated subsequences may typically be fromabout 4 to about 16 nucleotides in length.

The co-partitioned oligonucleotides also typically comprise otherfunctional sequences useful in the processing of the partitioned nucleicacids. These sequences include, e.g., targeted or random/universalamplification primer sequences for amplifying the genomic DNA from theindividual nucleic acids within the partitions while attaching theassociated barcode sequences, sequencing primers, hybridization orprobing sequences, e.g., for identification of presence of thesequences, or for pulling down barcoded nucleic acids, or any of anumber of other potential functional sequences. Again, co-partitioningof oligonucleotides and associated barcodes and other functionalsequences, along with sample materials is described in, for example,U.S. Patent Application Nos. U.S. patent application Ser. Nos.14/316,383, 14/316,398, 14/316,416, 14/316,431, 14/316,447, 14/316,463,all filed Jun. 26, 2014, as well as U.S. patent application Ser. No.14/175,935, filed Feb. 7, 2014, the full disclosures of which is herebyincorporated by reference in their entireties.

Briefly, in one exemplary process, beads are provided that each mayinclude large numbers of the above described oligonucleotides releasablyattached to the beads, where all of the oligonucleotides attached to aparticular bead may include the same nucleic acid barcode sequence, butwhere a large number of diverse barcode sequences may be representedacross the population of beads used. Typically, the population of beadsmay provide a diverse barcode sequence library that may include at least1000 different barcode sequences, at least 10,000 different barcodesequences, at least 100,000 different barcode sequences, or in somecases, at least 1,000,000 different barcode sequences. Additionally,each bead may typically be provided with large numbers ofoligonucleotide molecules attached. In particular, the number ofmolecules of oligonucleotides including the barcode sequence on anindividual bead may be at least bout 10,000 oligonucleotides, at least100,000 oligonucleotide molecules, at least 1,000,000 oligonucleotidemolecules, at least 100,000,000 oligonucleotide molecules, and in somecases at least 1 billion oligonucleotide molecules.

The oligonucleotides may be releasable from the beads upon theapplication of a particular stimulus to the beads. In some cases, thestimulus may be a photo-stimulus, e.g., through cleavage of aphoto-labile linkage that may release the oligonucleotides. In somecases, a thermal stimulus may be used, where elevation of thetemperature of the beads environment may result in cleavage of a linkageor other release of the oligonucleotides form the beads. In some cases,a chemical stimulus may be used that cleaves a linkage of theoligonucleotides to the beads, or otherwise may result in release of theoligonucleotides from the beads.

In accordance with the methods and systems described herein, the beadsincluding the attached oligonucleotides may be co-partitioned with theindividual samples, such that a single bead and a single sample arecontained within an individual partition. In some cases, where singlebead partitions are desired, it may be desirable to control the relativeflow rates of the fluids such that, on average, the partitions containless than one bead per partition, in order to ensure that thosepartitions that are occupied, are primarily singly occupied. Likewise,one may wish to control the flow rate to provide that a higherpercentage of partitions are occupied, e.g., allowing for only a smallpercentage of unoccupied partitions. In preferred aspects, the flows andchannel architectures are controlled as to ensure a desired number ofsingly occupied partitions, less than a certain level of unoccupiedpartitions and less than a certain level of multiply occupiedpartitions.

FIG. 3 illustrates one particular example method for barcoding andsubsequently sequencing a sample nucleic acid, particularly for use fora copy number variation or haplotype assay. First, a sample comprisingnucleic acid may be obtained from a source, 300, and a set of barcodedbeads may also be obtained, 310. The beads are preferably linked tooligonucleotides containing one or more barcode sequences, as well as aprimer, such as a random N-mer or other primer. Preferably, the barcodesequences are releasable from the barcoded beads, e.g., through cleavageof a linkage between the barcode and the bead or through degradation ofthe underlying bead to release the barcode, or a combination of the two.For example, in certain preferred aspects, the barcoded beads can bedegraded or dissolved by an agent, such as a reducing agent to releasethe barcode sequences. In this example, a low quantity of the samplecomprising nucleic acid, 305, barcoded beads, 315, and optionally otherreagents, e.g., a reducing agent, 320, are combined and subject topartitioning. By way of example, such partitioning may involveintroducing the components to a droplet generation system, such as amicrofluidic device, 325. With the aid of the microfluidic device 325, awater-in-oil emulsion 330 may be formed, wherein the emulsion containsaqueous droplets that contain sample nucleic acid, 305, reducing agent,320, and barcoded beads, 315. The reducing agent may dissolve or degradethe barcoded beads, thereby releasing the oligonucleotides with thebarcodes and random N-mers from the beads within the droplets, 335. Therandom N-mers may then prime different regions of the sample nucleicacid, resulting in amplified copies of the sample after amplification,wherein each copy is tagged with a barcode sequence, 340. Preferably,each droplet contains a set of oligonucleotides that contain identicalbarcode sequences and different random N-mer sequences. Subsequently,the emulsion is broken, 345 and additional sequences (e.g., sequencesthat aid in particular sequencing methods, additional barcodes, etc.)may be added, via, for example, amplification methods, 350 (e.g., PCR).Sequencing may then be performed, 355, and an algorithm applied tointerpret the sequencing data, 360. Sequencing algorithms are generallycapable, for example, of performing analysis of barcodes to alignsequencing reads and/or identify the sample from which a particularsequence read belongs. In addition, and as is described herein, thesealgorithms may also further be used to attribute the sequences of thecopies to their originating molecular context.

As noted above, while single bead occupancy may be the most desiredstate, it will be appreciated that multiply occupied partitions orunoccupied partitions may often be present. An example of a microfluidicchannel structure for co-partitioning samples and beads comprisingbarcode oligonucleotides is schematically illustrated in FIG. 4. Asshown, channel segments 402, 404, 406, 408 and 410 are provided in fluidcommunication at channel junction 412. An aqueous stream comprising theindividual samples 414 is flowed through channel segment 402 towardchannel junction 412. As described elsewhere herein, these samples maybe suspended within an aqueous fluid prior to the partitioning process.

Concurrently, an aqueous stream comprising the barcode carrying beads416 is flowed through channel segment 404 toward channel junction 412. Anon-aqueous partitioning fluid is introduced into channel junction 412from each of side channels 406 and 408, and the combined streams areflowed into outlet channel 410. Within channel junction 412, the twocombined aqueous streams from channel segments 402 and 404 are combined,and partitioned into droplets 418, that include co-partitioned samples414 and beads 416. As noted previously, by controlling the flowcharacteristics of each of the fluids combining at channel junction 412,as well as controlling the geometry of the channel junction, one canoptimize the combination and partitioning to achieve a desired occupancylevel of beads, samples or both, within the partitions 418 that aregenerated.

As will be appreciated, a number of other reagents may be co-partitionedalong with the samples and beads, including, for example, chemicalstimuli, nucleic acid extension, transcription, and/or amplificationreagents such as polymerases, reverse transcriptases, nucleosidetriphosphates or NTP analogues, primer sequences and additionalcofactors such as divalent metal ions used in such reactions, ligationreaction reagents, such as ligase enzymes and ligation sequences, dyes,labels, or other tagging reagents.

Once co-partitioned, the oligonucleotides disposed upon the bead may beused to barcode and amplify the partitioned samples. A particularlyelegant process for use of these barcode oligonucleotides in amplifyingand barcoding samples is described in detail in U.S. patent applicationSer. Nos. 14/316,383, 14/316,398, 14/316,416, 14/316,431, 14/316,447,14/316,463, all filed Jun. 26, 2014, the full disclosures of which arehereby incorporated by reference in their entireties. Briefly, in oneaspect, the oligonucleotides present on the beads that areco-partitioned with the samples and released from their beads into thepartition with the samples. The oligonucleotides typically include,along with the barcode sequence, a primer sequence at its 5′ end. Thisprimer sequence may be random or structured. Random primer sequences aregenerally intended to randomly prime numerous different regions of thesamples. Structured primer sequences can include a range of differentstructures including defined sequences targeted to prime upstream of aspecific targeted region of the sample as well as primers that have somesort of partially defined structure, including without limitationprimers containing a percentage of specific bases (such as a percentageof GC N-mers), primers containing partially or wholly degeneratesequences, and/or primers containing sequences that are partially randomand partially structured in accordance with any of the descriptionherein. As will be appreciated, any one or more of the above types ofrandom and structured primers may be included in oligonucleotides in anycombination.

Once released, the primer portion of the oligonucleotide can anneal to acomplementary region of the sample. Extension reaction reagents, e.g.,DNA polymerase, nucleoside triphosphates, co-factors (e.g., Mg2+ or Mn2+etc.), that are also co-partitioned with the samples and beads, thenextend the primer sequence using the sample as a template, to produce acomplementary fragment to the strand of the template to which the primerannealed, with complementary fragment includes the oligonucleotide andits associated barcode sequence. Annealing and extension of multipleprimers to different portions of the sample may result in a large poolof overlapping complementary fragments of the sample, each possessingits own barcode sequence indicative of the partition in which it wascreated. In some cases, these complementary fragments may themselves beused as a template primed by the oligonucleotides present in thepartition to produce a complement of the complement that again, includesthe barcode sequence. In some cases, this replication process isconfigured such that when the first complement is duplicated, itproduces two complementary sequences at or near its termini, to allowthe formation of a hairpin structure or partial hairpin structure, whichreduces the ability of the molecule to be the basis for producingfurther iterative copies. A schematic illustration of one example ofthis is shown in FIG. 5.

As the figure shows, oligonucleotides that include a barcode sequenceare co-partitioned in, e.g., a droplet 502 in an emulsion, along with asample nucleic acid 504. As noted elsewhere herein, the oligonucleotides508 may be provided on a bead 506 that is co-partitioned with the samplenucleic acid 504, which oligonucleotides are preferably releasable fromthe bead 506, as shown in panel A. The oligonucleotides 508 include abarcode sequence 512, in addition to one or more functional sequences,e.g., sequences 510, 514 and 516. For example, oligonucleotide 508 isshown as comprising barcode sequence 512, as well as sequence 510 thatmay function as an attachment or immobilization sequence for a givensequencing system, e.g., a P5 sequence used for attachment in flow cellsof an Illumina Hiseq or Miseq system. As shown, the oligonucleotidesalso include a primer sequence 516, which may include a random ortargeted N-mer for priming replication of portions of the sample nucleicacid 504. Also included within oligonucleotide 508 is a sequence 514which may provide a sequencing priming region, such as a “read1” or R1priming region, that is used to prime polymerase mediated, templatedirected sequencing by synthesis reactions in sequencing systems. Inmany cases, the barcode sequence 512, immobilization sequence 510 and R1sequence 514 may be common to all of the oligonucleotides attached to agiven bead. The primer sequence 516 may vary for random N-mer primers,or may be common to the oligonucleotides on a given bead for certaintargeted applications.

Based upon the presence of primer sequence 516, the oligonucleotides areable to prime the sample nucleic acid as shown in panel B, which allowsfor extension of the oligonucleotides 508 and 508 a using polymeraseenzymes and other extension reagents also co-portioned with the bead 506and sample nucleic acid 504. As shown in panel C, following extension ofthe oligonucleotides that, for random N-mer primers, would anneal tomultiple different regions of the sample nucleic acid 504; multipleoverlapping complements or fragments of the nucleic acid are created,e.g., fragments 518 and 520. Although including sequence portions thatare complementary to portions of sample nucleic acid, e.g., sequences522 and 524, these constructs are generally referred to herein ascomprising fragments of the sample nucleic acid 504, having the attachedbarcode sequences. As will be appreciated, the replicated portions ofthe template sequences as described above are often referred to hereinas “fragments” of that template sequence. Notwithstanding the foregoing,however, the term “fragment” encompasses any representation of a portionof the originating nucleic acid sequence, e.g., a template or samplenucleic acid, including those created by other mechanisms of providingportions of the template sequence, such as actual fragmentation of agiven molecule of sequence, e.g., through enzymatic, chemical ormechanical fragmentation. In preferred aspects, however, fragments of atemplate or sample nucleic acid sequence will denote replicated portionsof the underlying sequence or complements thereof.

The barcoded nucleic acid fragments may then be subjected tocharacterization, e.g., through sequence analysis, or they may befurther amplified in the process, as shown in panel D. For example,additional oligonucleotides, e.g., oligonucleotide 508 b, also releasedfrom bead 306, may prime the fragments 518 and 520. In particular,again, based upon the presence of the random N-mer primer 516 b inoligonucleotide 508 b (which in many cases will be different from otherrandom N-mers in a given partition, e.g., primer sequence 516), theoligonucleotide anneals with the fragment 518, and is extended to createa complement 526 to at least a portion of fragment 518 which includessequence 528, that comprises a duplicate of a portion of the samplenucleic acid sequence. Extension of the oligonucleotide 508 b continuesuntil it has replicated through the oligonucleotide portion 508 offragment 518. As noted elsewhere herein, and as illustrated in panel D,the oligonucleotides may be configured to prompt a stop in thereplication by the polymerase at a desired point, e.g., afterreplicating through sequences 516 and 514 of oligonucleotide 508 that isincluded within fragment 518. As described herein, this may beaccomplished by different methods, including, for example, theincorporation of different nucleotides and/or nucleotide analogues thatare not capable of being processed by the polymerase enzyme used. Forexample, this may include the inclusion of uracil containing nucleotideswithin the sequence region 512 to prevent a non-uracil tolerantpolymerase to cease replication of that region. As a result a fragment526 is created that includes the full-length oligonucleotide 508 b atone end, including the barcode sequence 512, the attachment sequence510, the R1 primer region 514, and the random N-mer sequence 516 b. Atthe other end of the sequence will be included the complement 516′ tothe random N-mer of the first oligonucleotide 508, as well as acomplement to all or a portion of the R1 sequence, shown as sequence514′. The R1 sequence 514 and its complement 514′ are then able tohybridize together to form a partial hairpin structure 528. As will beappreciated because the random N-mers differ among differentoligonucleotides, these sequences and their complements would not beexpected to participate in hairpin formation, e.g., sequence 516′, whichis the complement to random N-mer 516, would not be expected to becomplementary to random N-mer sequence 516 b. This would not be the casefor other applications, e.g., targeted primers, where the N-mers wouldbe common among oligonucleotides within a given partition.

By forming these partial hairpin structures, it allows for the removalof first level duplicates of the sample sequence from furtherreplication, e.g., preventing iterative copying of copies. The partialhairpin structure also provides a useful structure for subsequentprocessing of the created fragments, e.g., fragment 526.

All of the fragments from multiple different partitions may then bepooled for sequencing on high throughput sequencers as described herein.Because each fragment is coded as to its partition of origin, thesequence of that fragment may be attributed back to its origin basedupon the presence of the barcode. This is schematically illustrated inFIG. 6. As shown in one example, a nucleic acid 604 originated from afirst source 600 (e.g., individual chromosome, strand of nucleic acid,etc.) and a nucleic acid 606 derived from a different chromosome 602 orstrand of nucleic acid are each partitioned along with their own sets ofbarcode oligonucleotides as described above.

Within each partition, each nucleic acid 604 and 606 is then processedto separately provide overlapping set of second fragments of the firstfragment(s), e.g., second fragment sets 608 and 610. This processingalso provides the second fragments with a barcode sequence that is thesame for each of the second fragments derived from a particular firstfragment. As shown, the barcode sequence for second fragment set 608 isdenoted by “1” while the barcode sequence for fragment set 610 isdenoted by “2”. A diverse library of barcodes may be used todifferentially barcode large numbers of different fragment sets.However, it is not necessary for every second fragment set from adifferent first fragment to be barcoded with different barcodesequences. In fact, in many cases, multiple different first fragmentsmay be processed concurrently to include the same barcode sequence.Diverse barcode libraries are described in detail elsewhere herein.

The barcoded fragments, e.g., from fragment sets 608 and 610, may thenbe pooled for sequencing using, for example, sequence by synthesistechnologies available from Illumina or Ion Torrent division of ThermoFisher, Inc. Once sequenced, the sequence reads 612 can be attributed totheir respective fragment set, e.g., as shown in aggregated reads 614and 616, at least in part based upon the included barcodes, andoptionally, and preferably, in part based upon the sequence of thefragment itself. The attributed sequence reads for each fragment set arethen assembled to provide the assembled sequence for each samplefragment, e.g., sequences 618 and 620, which in turn, may be furtherattributed back to their respective original chromosomes (600 and 602).Methods and systems for assembling genomic sequences are described in,for example, U.S. patent application Ser. No. 14/752,773, filed Jun. 26,2015, the full disclosure of which is hereby incorporated by referencein its entirety.

III. Application of Methods and Systems to Targeted Sequencing

In one aspect of the systems and methods described herein are used toobtain sequence information from targeted regions of a genome.

By “targeted” regions of a genome (as well as any grammaticalequivalents thereof) is meant a whole genome or any one or more regionsof a genome identified as of interest and/or selected through one ormore methods described herein. The targeted regions of the genomesequenced by methods and systems described herein include withoutlimitation introns, exons, intergenic regions, or any combinationthereof. In certain examples, the methods and systems described hereinprovide sequence information on whole exomes, portions of exomes, one ormore selected genes (including selected panels of genes), one or moreintrons, and combinations of intronic and exonic sequences.

Targeted regions of the genome may also include certain portions orpercentages of the genome rather than regions identified by sequence. Incertain embodiments, targeted regions of the genome captured andanalyzed in accordance with the methods described herein includeportions of the genome located every 1, 2, 5, 10, 15, 20, 25, 50, 100,200, 250, 500, 750, 1000, or 10000 kilobases of a genome. In furtherembodiments, targeted regions of the genome comprise 5%, 10%, 15%, 20%,30%, 40%, 50%, 60%, 70%, 80%, 90%, 95% of the whole genome. In stillfurther embodiments, the targeted regions comprise 1-10%, 5-20%, 10-30%,15-40%, 20-50%, 25-60%, 30-70%, 35-80%, 40-90%, or 45-95% of the wholegenome.

In general, targeted regions of a genome are captured for use in anysequencing methods known in the art and described herein. By “captured”as used herein is meant any method or system for enriching a populationof nucleic acid and/or nucleic acid fragments such that the resultantpopulation contains an increased percentage of the targeted regions ofinterest as compared to the genomic regions that are not of interest. Infurther embodiments, the enriched population contains at least 50%, 55%,60%, 70%, 75%, 80%, 85%, 90%, 95%, 97%, 98%, 99%, or 100% nucleicacids/nucleic acid fragments comprising the targeted regions.

Capture methods generally include chip-based methods, in which targetedregions are captured through hybridization or other association withcapture molecules on a surface, and solution based methods, in whicholigonucleotide probes (baits), which are complementary to the targetedregions (or to regions near the targeted regions) are hybridized togenomic fragment libraries. The probes used in the capture methodsdisclosed herein are generally attached to capture molecules, such asbiotin, which can be used to “pull down” the probes and the fragments towhich they are hybridized—these pull down methods include any methods bywhich the baits hybridized to nucleic acids or nucleic acid fragmentsthat contain the targeted regions of interest are separated fromfragments that do not contain the regions of interest. In embodiments inwhich the probes are biotynilated, magnetic streptavidin beads are usedto selectively pull-down and enrich baits with bound targeted regions.

In further aspects, a library of baits is used that covers all thetargeted regions desired for further study. In the case of whole exomeanalysis, such a library of baits thus includes oligonucleotide probesthat together cover the full exome. In certain embodiments, onlyportions of the exome are needed for further analysis. In suchembodiments, the baits are designed to target that subset of the exome.This design can be accomplished using methods and algorithms known inthe art and in general is based upon a reference sequence, such as thehuman genome.

In some examples, the targeted genomic regions processed and sequencedin accordance with the methods and systems described herein are full orpartial exomes. These full or partial exomes can be captured forsequencing using any methods known in the art, including withoutlimitation any of the Roche/NimbleGen exome protocols, including theNimbleGen 2.1M Human Exome array and the NimbleGen SeqCap EZ ExomeLibrary, any of the Agilent SureSelect products, any Illumina exomecapture products, including the TruSeq and Nextera Exome products, andany other products, methods, systems and protocols known in the art.

In further embodiments, when the targeted regions of interest comprisewhole or portions of the exome, the baits used to capture those targetedregions may be designed to be complementary to those exonic sequences.In other embodiments, the baits are not complementary to the exonicsequences themselves but are instead complementary to sequences near theexonic sequence or to intronic sequences between two exons. Such designsare also referred to herein as “anchored exome capture” or “intronicbaiting,” by which, as discussed herein, is meant a process in which oneor more portions of an exome are captured through the use of baitscomplementary to one or more intronic sequences near or adjacent to theone or more portions of the exome that are of interest. For example, asschematically illustrated in FIG. 2, a genomic sequence 201 comprisesexonic regions 202 and 203. Those exonic regions can be captured byutilizing baits directed to one or more of the intronic sequences nearby(for example intronic region 204 and/or 205 to capture exonic region 202and intronic region 206 for capture of exonic region 203). In otherwords, a population of fragments comprising exonic regions 202 or 203would be captured through the use of baits complementary to intronicregions 204 and/or 205 and 206. In some embodiments, intronic baiting isused to bridge exons separated by long intronic regions by sparselybaiting longer introns. In such embodiments, the baits are notnecessarily targeting intronic regions that are close to the exonicregions of interest, but the baits are instead designed to targetregions separated by particular distances (or sets of distances) or aredesigned to tile across the intronic regions by a particular number ofbases or combinations of numbers of bases. Such embodiments aredescribed in further detail below.

In some embodiments, the intronic regions used for anchored exomecapture/intronic baiting techniques of the invention are adjacent to theexonic region to be captured. In further embodiments, the intronicregions are separated from the exonic region to be captured by about1-50, 2-45, 3-40, 4-35, 5-30, 6-25, 7-20, 8-15, 9-10, 2-20, 3-15, 4-10,5-30, 10-40, 15-50, 20-75, 25-100 nucleotides. In still furtherembodiments, the intronic regions are separated from the exonic regionsto be captured by about 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 125,150, 175, 200, 300, 400, or 500 nucleotides. In further embodiments,particularly for situations in which sparse baiting of intronic regionsis of use (such as for phase variant detection or identification oflinked exonic regions across large intronic distances) the intronicregions are separated from the exonic regions to be captured bydistances on the orders of kilobases, e.g., 1-20, 2-18, 3-16, 4-14,5-12, 6-10 kilobases. Since the original molecular context of theenriched population of oligonucleotides is retained, this sparse baitingof intronic regions allows for the linking of sequence informationbetween exonic regions separated by long introns.

In further aspects, rather than designing the baits to target particularregions of the genome, a tiling approach is used. In such an approach,rather than targeting specific exonic or intronic regions, the baits areinstead designed to be complementary to portions of the genome atparticular ranges or distances. For example, the library of baits can bedesigned to hybridize to sequences located every 5 kilobases (kb) alongthe genome, such that applying this library of baits to a fragmentedgenomic sample will capture only a certain subset of the genome—i.e.,those regions that are contained in fragments containing complementarysequences to the baits. As will be appreciated, the baits can bedesigned based on a reference sequence, such as a human genome referencesequence. In further embodiments, the tiled library of baits is designedto capture regions every 1, 2, 5, 10, 15, 20, 25, 50, 100, 200, 250,500, 750, 1000, or 10000 kilobases of a genome. In some examples, thistiling method has the effect of sparsely capturing intronic regions,thus providing a way to link sequence information of exonic regions thatare separated by long intronic regions, because the original molecularcontext of those exonic regions captured through sparse capture ofintronic regions is retained.

In still further embodiments, the baits are designed to tile the genomein a random or combined manner—for example, a mixture of tiled librariescan be used where some of the libraries capture regions every 1 kb,whereas other libraries in the mixture capture regions every 100 kb. Instill further embodiments, the tiled libraries are designed so that thebaits target within a range of positions within the genome—for example,the baits may target regions of every 1-10, 2-5, 5-200, 10-175, 15-150,20-125, 30-100, 40-75, 50-60 kb of the genome. In further examples, thetiled or other capture methods described herein will capture about 5%,10%, 15%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95% of the wholegenome. As will be appreciated, such tiling methods of capture willcapture both intronic and exonic regions of the genome for furtheranalysis such as sequencing.

In yet further embodiments and in accordance with any of the methodsdescribed herein, the library of baits used in methods of the presentinvention is a product of informed design that fulfills one or morecharacteristics as further described herein. This informed designincludes instances in which the library of baits is directed toinformative single nucleotide polymorphisms (SNPs). As discussed above,the term “informative SNPs” as used herein refers to SNPs that areheterozygous. The library of baits in some examples is designed tocontain a plurality of probes that are directed to regions of thegenomic sample that contain informative SNPs. By “directed to” as usedherein is meant that the probes contain sequences that are complementaryto those regions of the genomic sequences. Informed bait design providesthe ability to optimize targeted sequencing methods by allowing fortargeted enrichment with full coverage while at the same time reducingthe number of probes needed (and thus reducing costs and streamliningthe work flow).

In general, for methods utilizing informed bait design, the libraries ofbaits are designed to include baits directed to particular sequences intargeted regions of the genome based on the presence or absence ofinformative SNPs in those regions and/or the location(s) of thoseinformative SNPs. An exemplary illustration of general considerationsfor informed bait design is provided in FIG. 8. A region of the genome801 can include exons (802 and 803). In some examples, an informativeSNP 804 will be located at the boundary between the exon (802) and theadjacent intron. In such a situation, the bait library can be designedto include probes directed to one or more nucleotides (805) at aspecified distance away from the boundary. In further examples in whichthere is no informative SNP at the boundary between the exon and theadjacent intron (806), the bait library can be designed to includeprobes directed to one or more positions in the intron near thatboundary (807 and 808). Those positions will preferably includeinformative SNPs, but may also include other SNPs and/or other sequencesas needed. In still further examples in which an exon 803 contains aninformative SNP 809 in the interior of the exon but no informative SNPsat the boundaries, the bait library can be designed to include probesdirected to several positions 810, 811, and 812 in the adjacent intronthat include a mixture of informative and non-informative SNPs (as wellas any other sequences as needed).

In some aspects, one or more input characteristics are used to design aprobe bait library that is directed to shifting locations along thegenome based on those input characteristics as well as map quality invarious regions. This design is generally based on spacing betweeninformative SNPs rather than on the locations of introns and exons.However, as will be appreciated, any of the descriptions provided hereinrelated to bait design based on intron and exon locations can also beused in combination with the informed bait design methods based oninformative SNPs. Input characteristics used in informed bait designinclude without limitation and in any combination locations of exons,introns, intergenic regions, informative SNPs, as well as regions ofrepeating sequences (such as GC-rich regions), centromeres, and samplenucleic acid lengths.

For ease of discussion, different characteristics of informed designprobe libraries are described below in terms of different potentialembodiments. As will be appreciated, any of the probe librariesdiscussed herein, whether using any of the informed design elements orany of the other types of design discussed above can be used singly orin any combination. The design elements utilized are selected based onthe targeted genomic regions of interest as well as sample input and thequality of mapping for those regions of interest.

In some embodiments, probe bait libraries are designed to include probesdirected to regions that have a high likelihood of containinginformative SNPs in a given sample. Such targets may include individualbases (the informative SNPs themselves) or one or more bases that areproximal or adjacent to the informative SNPs. In still furtherembodiments, the targets for the probe baits may be directly adjacent tothe informative SNPs or separated by distances from about 1-200, 10-190,20-180, 30-170, 40-160, 50-150, 60-140, 70-130, 80-120, 90-100 basesfrom an informative SNP.

In further embodiments, the probe bait libraries include probes directedto regions of particular densities related to the average length of thenucleic acid molecules. For example, the probes can be designed toinclude probes at a density of target sequences that is x-fold moredense than the average length of the nucleic acid molecules/fragments towhich the probes are hybridizing, where x can be without limitation 1,5, 10, 20, 50, 75, 100, 125, 150, or 200. Increasing the density of theprobe targets relative to the length of the nucleic acids increases theability to link probes across loci on the same physical molecule. Suchmethods can also improve the probability that the linked regions willinclude informative SNPs, thus further improving the ability of theprobe bait libraries to attach to targeted regions of the genome.

The density of the probe targets may also be increased in situations inwhich (at the population level) there is not a high probability ofinformative SNPs in a given region of interest. In such regions, tilingmethods such as those described herein can be used to direct probes atperiodic spacings along the region. In certain embodiments, the densityof the spacing can be differentially based, such that the density ofprobe spacing in these regions lacking informative SNPs are at a 1, 2,5, 10, 25, 50-fold shorter distance than probe spacing in regionscontaining informative SNPs.

In further embodiments, the probe bait library is designed to consideronly informative SNP distribution within a gene (including exons andintrons). This method of design is directed to capture a sufficientnumber of heterozygous SNPs at key locations to link/phase from one endof the gene to the other. Such a design method includes baits directedto sets of targets that combine exonic informative SNPs with one or morenon-exonic SNPs such that the distance between informative SNPs in agene is below the above described densities of spacing.

Such informed design methods allow detection of not only generaltargeted regions of the genome, but also allows the detection andphasing of genomic structural variations, such as translocations andgene fusions. By ensuring that any individual gene can be phased, itfollows that the vast majority of gene fusion events can be detected andphased using the methods described herein.

In certain embodiments and in accordance with any of the above, the baitlibraries are designed to target probes at distances of about 1 kb toabout 2 Mb. In further embodiments, the distances are from about 1-50,5-45, 10-40, 15-35, 20-30, 10-50 kb.

In further embodiments, the nucleic acid fragments being targeted by theprobe baits are from about 2 kb to about 250 Mb. In still furtherembodiments, the fragments are from about 10-1000, 20-900, 30-800,40-700, 50-600, 60-500, 70-400, 80-300, 90-200, 100-150, 50-500, 25-300kb.

In some embodiments, the probe bait libraries are designed such thatabout 60-95% of the probes hybridize to sequences containing informativeSNPs. In further embodiments, the probe bait libraries are designed suchthat about 65%-85%, 70% -80%, 60-90%, 80-90%, 90-95%, 95%-99% of theprobes in the library of probes are designed to hybridize to informativeSNPs. In still further embodiments, at least 65%, 75%, 85%, 90%, 91%,92%, 93%, 94%, 95%, 96%, 97% of the probes in the library of probes aredesigned to hybridize to informative SNPs. As will be appreciated, for aprobe to be designed to “hybridize to” an informative SNP means thatsuch a probe hybridizes to a sequence region that includes thatinformative SNP.

In further embodiments, the probe bait libraries are designed to includea plurality of probes directed to informative SNPs that are locatedwithin both exons and introns in targeted portions of the genomicsample.

In still further embodiments, the libraries are designed such that amajority of the probes in the library hybridize to informative SNPsspaced apart by about 1-15, 5-10, 3-6 kb. In yet further embodiments,the majority of the probes in the library of probes are further designedto hybridize to informative SNPs spaced apart by about 1, 3, 5, 10, 20,30, 50 kb.

In further embodiments, a plurality of probes within the library ofprobes are designed such that for targeted portions of the genomicsamples in which there are no informative SNPs within 5-300, 10-50,20-100, 30-150, or 40-200 kb of boundaries between exons and introns,the plurality of probes is designed to hybridize at an informative SNPwithin an intron from those boundaries.

In further embodiments, a plurality of probes within the library ofprobes are designed such that for targeted portions of the genomicsamples in which there is a first informative SNP within an exon andthat first informative SNP is located 5-300, 10-50, 20-100, 30-150, or40-200 kb from a boundary with an adjacent intron and a secondinformative SNP within the adjacent intron and that second informativeSNP is located 10-50 kb from the boundary, the plurality of probes isdesigned to hybridize to a region of the genomic sample between thefirst and second informative SNPs;

In further embodiments, a plurality of probes within the library ofprobes are designed such that for targeted portions of the genomicsamples comprising no informative SNPs for at least 5-300, 10-50,20-100, 30-150, or 40-200 kb, the plurality of probes is designed tohybridize every 0.5, 1, 3, or 5 kb to those targeted portions of thegenomic samples. In further embodiments, the plurality of probes isdesigned to hybridize every 0.1, 0.5, 1, 1.5, 3, 5, 10, 15, 20, 30, 35,40, 45, 50 kb along those targeted portions of the genomic samples.

In further embodiments, a plurality of probes within the library ofprobes are designed such that for targeted portions of the genomicsamples in which there are no informative SNPs within 5-300, 10-50,20-100, 30-150, or 40-200 kb of boundaries between exons and introns,the plurality of probes are designed to hybridize to the next closestinformative SNP to the exon-intron boundaries.

In further embodiments, the library of probes comprises probes designedto hybridize to regions of the genomic sample that flank exons at adensity that provides linkage information across barcodes.

In still further embodiments, the range of coverage represented by thelibrary of probes is inversely proportional to the distribution oflengths of the individual nucleic acid fragment molecules of the genomicsample in the discrete partitions, such that methods containing a higherproportion of longer individual nucleic acid fragment molecules uselibraries of probes with smaller ranges of coverage.

In still further embodiments, the library of probes is optimized forcoverage of the targeted portions of the genomic sample. In yet furtherembodiments, the density of coverage may be lower for regions of highmap quality, particularly for those regions containing informative SNPs,and the density may further be higher for regions of low map quality toensure that linkage information is provided across targeted regions.

In yet further embodiments, the library of probes has features informedby characteristics of the one or more targeted portions of a genomicsample, such that for targeted portions with high map quality, thelibrary of probes comprises probes that hybridize to informative SNPswithin 1 kb-1 Mb of boundaries of exons and introns. The library ofprobes may in such situations further include probes that hybridize toinformative SNPs within 10-500, 20-450, 30-400, 40-350, 50-300, 60-250,70-200, 80-150, 90-100 kb of boundaries of exons and introns.

In yet further embodiments, the library of probes has features informedby characteristics of the one or more targeted portions of a genomicsample, such that for targeted portions in which the distribution oflengths of the barcoded fragments has a high proportion of fragmentslonger than about 100, 150, 200, 250 kb, the library of probes compriseprobes that hybridize to informative SNPs separated by at least 50 kb.The library of probes may in such situations further include probes thathybridize to informative SNPs separated by at least 5, 10, 15, 20, 25,30, 35, 40, 45, 50, 75, 100, 125, 150, 175, 200 kb.

In yet further embodiments, the library of probes has features informedby characteristics of the one or more targeted portions of a genomicsample, such that for targeted portions with low map quality, thelibrary of probes comprises probes that hybridize to informative SNPswithin 1 kb of exon-intron boundaries. The library of probes may in suchsituations further include probes that hybridize to informative SNPswithin 2, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 75, 100, 125, 150, 175,200 kb of exon-intron boundaries. In such situations, the library willfurther include probes that hybridize and probes that hybridize toinformative SNPs within exons, within introns, or both.

In yet further embodiments, the library of probes has features informedby characteristics of the one or more targeted portions of a genomicsample, such that for targeted portions comprising intergenic regions,the library of probes comprises probes that hybridize to informativeSNPs spaced apart at distances of at least 1, 2, 5, 10, 15, 20, 25, 30,35, 40, 45, 50, 75, 100 kb.

The baits used in the capture methods described herein can be of anysize or structure that is useful for enriching a population of fragmentsfor fragments containing targeted regions of the genome. As discussedabove, generally the baits of use in the present invention compriseoligonucleotide probes that are attached to a capture molecule, such asbiotin. The oligonucleotide probes may be complementary to sequenceswithin a targeted region of interest, or they may be complementary toregions outside of the targeted region but close enough to that targetedregion that both the “anchoring” region and the targeted region arewithin the same fragment, such that the bait is able to pull down thetargeted region by hybridizing to that nearby region (such as a flankingintron).

The capture molecule attached to the bait may be any capture moleculethat can be used for isolating the bait and its hybridization partnerfrom other fragments in a population. In general, the baits used hereinare attached to biotin, and then solid supports comprising streptavidin(including without limitation magnetic streptavidin beads) can be usedto capture the baits and the fragments to which they are hybridized.Other capture molecule pairs may include without limitationbiotin/neutravidin, antigen/antibody, or complementary oligonucleotidesequences.

In further embodiments, the oligonucleotide probe portion of the baitscan be of any length suitable for hybridizing to targeted regions or toregions near targeted regions. In some embodiments, the oligonucleotideprobe portion of the baits used in accordance with the methods describedherein—i.e., the portion that hybridizes to the targeted region of thegenome or to a region near the targeted region—generally has a lengthfrom about 10 to about 150 nucleotides in length (e.g., 35 nucleotides,50 nucleotides, 100 nucleotides) and is chosen to specifically hybridizeto a target sequence of interest. In further embodiments, theoligonucleotide probe portion comprises a length of about 5-10, 10-50,20-100, 30-90, 40-80, 50-70, nucleotides in length. As will beappreciated, any of the oligonucleotide probe portions described hereinmay comprise RNA, DNA, non-natural nucleotides such as PNAs, LNAs, andso on, or any combinations thereof.

An advantage of the methods and systems described herein is that thetargeted regions that are captured are processed prior to capture insuch a way that even after the steps of capturing the targeted regionsand conducting sequencing analyses, the original molecular context ofthose targeted regions is retained. The ability to attribute specifictargeted regions to their original molecular context (which can includethe original chromosome or chromosomal region from which they arederived and/or the location of particular targeted regions in relationto each other within the full genome) provides a way to obtain sequenceinformation from regions of the genome that are otherwise poorly mappedor have poor coverage using traditional sequencing techniques.

For example, some genes possess long introns that are too long to spanusing generally available sequencing techniques, particularly usingshort-read technologies. Short-read technologies are often preferablesequencing technologies, because they possess superior accuracy ascompared to long-read technologies. However, generally used short-readtechnologies are unable to span across long regions of the genome, andthus information may not be obtainable using these conventionaltechnologies in regions of the genome that are difficult to characterizedue to structural characteristics such as long lengths of tandemrepeating sequences, high GC content, and exons containing long introns.In the methods and systems described herein, however, the molecularcontext of targeted regions is retained, generally through the taggingprocedure illustrated in FIG. 1 and described in further detail herein.As such, links can be made across extended regions of the genome. Forexample, as schematically illustrated in FIG. 2B, nucleic acid molecule207 contains two exons (shaded bars) with a long intronic region (208).In the methods described herein, the individual nucleic acid molecule207 is distributed into its own discrete partition 211 and thenfragmented such that different fragments contain different portions ofthe exons and the intron. Because each of those fragments is tagged suchthat any sequence information obtained from the fragments is thenattributable to the discrete partition in which it was generated, eachfragment is thus also attributable to the individual nucleic acidmolecule 207 from which it was derived.

In general, and as is described in further detail herein, afterfragmentation and tagging, fragments from different partitions arecombined together. Targeted capture methods can then be used to enrichthe population of fragments that undergoes further analysis, such assequencing, with fragments containing the targeted region of interest.In the example illustrated in FIG. 2B, the baits used will enrich thepopulation of fragments to capture only those containing a portion ofthe exons, but regions outside of the exon and intron (such as 209 and210) would not be captured. Thus, the final population of fragments thatundergoes sequencing will be enriched for the fragments containing theportions of the exons, even if those exons are separated by a longintronic region. Short read, high accuracy sequencing technologies canthen be used to identify the sequences of this enriched population offragments, and because each of the fragments is tagged and thusattributable to its original molecular context, i.e., its originalindividual nucleic acid molecule, the short read sequences can be piecedtogether to provide information about the relationship between theexons. In some embodiments, the baits used to capture fragmentscontaining all or part of one or more exons are complementary to one ormore portions of the one or more exons themselves. In other embodiments,the baits are complementary to one or more portions of the interveningintrons or to sequences adjacent to or near the exon on either the 3′ or5′ side of the exon regions (such baits are also referred to herein as“intronic baits”). In further embodiments, the baits used to capture thefragments containing all or part of the exon include baits complementaryto the exon itself and intronic baits.

The ability to retain the molecular context of the targeted regionscaptured for sequencing also provides the advantage of allowing forsequencing across poorly characterized regions of the genome. As will beappreciated, a significant percentage (at least 5-10% according to, forexample Altemose et al., PLOS Computational Biology, May 15, 2014, Vol.10, Issue 5) of the human genome remains unassembled, unmapped, andpoorly characterized. The reference assembly generally annotates thesemissing regions as multi-megabase heterochromatic gaps, found primarilynear centromeres and on the short arms of the acrocentric chromosomes.This missing fraction of the genome includes structural features thatremain resistant to accurate characterization using generally usedsequencing technologies. By providing the ability to link informationacross extended regions of the genome, the methods described hereinprovide a way to allow for sequencing across these poorly characterizedregions.

In some examples, sample preparation methods, including methods offragmenting, amplifying, partitioning, and otherwise processing genomicDNA, can lead to biases or lower coverage of certain regions of agenome. Such biases or lowered coverage can be compensated for in themethods and systems disclosed herein by altering the concentration ofbaits used to capture targeted regions of the genome. For example, insome situations it is known that certain regions of the genome will havelow coverage after the fragment library is processed, such as regionscontaining high GC content or other structural variations that lead tobias toward certain areas of the genome over others. In such situations,the library of baits can be altered to increase the concentration ofbaits directed to those regions of low coverage—in other words, thepopulation of baits used may be “spiked” to ensure that a sufficientnumber of fragments containing targeted regions of the genome in thoselow coverage areas are obtained in the final population of fragments tobe sequenced. Such spiking of baits may be conducted through design ofcustom libraries in some embodiments. In further embodiments, thespiking of baits can be conducted in commercially available whole exomekits, such that a custom library of baits directed toward the lowercoverage regions are added to off-the-shelf exome capture kits.

An advantage of the methods and systems described herein is that thetargeted regions that are captured are processed prior to capture insuch a way that even after the steps of capturing the targeted regionsand conducting sequencing analyses, the original molecular context ofthose targeted regions is retained. As is discussed in further detailherein, the ability to attribute specific targeted regions to theiroriginal molecular context (which can include the original chromosome orchromosomal region from which they are derived and/or the location ofparticular targeted regions in relation to each other within the fullgenome) provides a way to obtain sequence information from regions ofthe genome that are otherwise poorly mapped or have poor coverage usingtraditional sequencing techniques.

For example, some genes possess long introns that are too long to spanusing generally available sequencing techniques, particularly usingshort-read technologies that possess superior accuracy as compared tolong-read technologies. In the methods and systems described herein,however, the molecular context of targeted regions is retained,generally through the tagging procedure illustrated in FIG. 1 anddescribed in further detail herein. As such, links can be made acrossextended regions of the genome. For example, as schematicallyillustrated in FIG. 2B, nucleic acid molecule 207 contains exons (shadedbars) interrupted by a long intronic region. Generally used sequencingtechnologies would be unable to span the distance across the intron toprovide information on the relationship between the two exons. In themethods described herein, the individual nucleic acid molecule 207 isdistributed into its own discrete partition 209 and then fragmented suchthat different fragments contain different portions of the exons and theintron. Because each of those fragments is tagged such that any sequenceinformation obtained from the fragments is then attributable to thediscrete partition in which it was generated, each fragment is thus alsoattributable to the individual nucleic acid molecule 207 from which itwas derived. In general, and as is described in further detail herein,after fragmentation and tagging, fragments from different partitions arecombined together. Targeted capture methods can then be used to enrichthe population of fragments that undergoes further analysis, such assequencing, with fragments containing the targeted region of interest.In the example illustrated in FIG. 2B, the baits used will enrich thepopulation of fragments to capture only those containing a portion ofone of exons, but regions outside of the exons (such as 209 and 210)would not be captured. Thus, the final population of fragments thatundergoes sequencing will be enriched for the fragments containing theexons of interest. Short read, high accuracy sequencing technologies canthen be used to identify the sequences of this enriched population offragments, and because each of the fragments is tagged and thusattributable to its original molecular context, i.e., its originalindividual nucleic acid molecule, the short read sequences can be piecedtogether to span across the length of the intervening intron (which canin some examples be on the order of 1, 2, 5, 10 or more kilobases inlength) to provide linked sequence information on the two exons.

As noted above, the methods and systems described herein provideindividual molecular context for short sequence reads of longer nucleicacids. As used herein, individual molecular context refers to sequencecontext beyond the specific sequence read, e.g., relation to adjacent orproximal sequences, that are not included within the sequence readitself, and as such, will typically be such that they would not beincluded in whole or in part in a short sequence read, e.g., a read ofabout 150 bases, or about 300 bases for paired reads. In particularlypreferred aspects, the methods and systems provide long range sequencecontext for short sequence reads. Such long range context includesrelationship or linkage of a given sequence read to sequence reads thatare within a distance of each other of longer than 1 kb, longer than 5kb, longer than 10 kb, longer than 15 kb, longer than 20 kb, longer than30 kb, longer than 40 kb, longer than 50 kb, longer than 60 kb, longerthan 70 kb, longer than 80 kb, longer than 90 kb or even longer than 100kb, or longer. By providing longer range individual molecular context,the methods and systems of the invention also provide much longerinferred molecular context. Sequence context, as described herein caninclude lower resolution context, e.g., from mapping the short sequencereads to the individual longer molecules or contigs of linked molecules,as well as the higher resolution sequence context, e.g., from long rangesequencing of large portions of the longer individual molecules, e.g.,having contiguous determined sequences of individual molecules wheresuch determined sequences are longer than 1 kb, longer than 5 kb, longerthan 10 kb, longer than 15 kb, longer than 20 kb, longer than 30 kb,longer than 40 kb, longer than 50 kb, longer than 60 kb, longer than 70kb, longer than 80 kb, longer than 90 kb or even longer than 100 kb. Aswith sequence context, the attribution of short sequences to longernucleic acids, e.g., both individual long nucleic acid molecules orcollections of linked nucleic acid molecules or contigs, may includeboth mapping of short sequences against longer nucleic acid stretches toprovide high level sequence context, as well as providing assembledsequences from the short sequences through these longer nucleic acids.

IV. Samples

As will be appreciated, the methods and systems discussed herein can beused to obtain targeted sequence information from any type of genomicmaterial. Such genomic material may be obtained from a sample taken froma patient. Exemplary samples and types of genomic material of use in themethods and systems discussed herein include without limitationpolynucleotides, nucleic acids, oligonucleotides, circulating cell-freenucleic acid, circulating tumor cell (CTC), nucleic acid fragments,nucleotides, DNA, RNA, peptide polynucleotides, complementary DNA(cDNA), double stranded DNA (dsDNA), single stranded DNA (ssDNA),plasmid DNA, cosmid DNA, chromosomal DNA, genomic DNA (gDNA), viral DNA,bacterial DNA, mtDNA (mitochondrial DNA), ribosomal RNA, cell-free DNA,cell free fetal DNA (cffDNA), mRNA, rRNA, tRNA, nRNA, siRNA, snRNA,snoRNA, scaRNA, microRNA, dsRNA, viral RNA, and the like. In summary,the samples that are used may vary depending on the particularprocessing needs.

Any substance that comprises nucleic acid may be the source of a sample.The substance may be a fluid, e.g., a biological fluid. A fluidicsubstance may include, but not limited to, blood, cord blood, saliva,urine, sweat, serum, semen, vaginal fluid, gastric and digestive fluid,spinal fluid, placental fluid, cavity fluid, ocular fluid, serum, breastmilk, lymphatic fluid, or combinations thereof. The substance may besolid, for example, a biological tissue. The substance may comprisenormal healthy tissues, diseased tissues, or a mix of healthy anddiseased tissues. In some cases, the substance may comprise tumors.Tumors may be benign (non-cancer) or malignant (cancer). Non-limitingexamples of tumors may include : fibrosarcoma, myxosarcoma, liposarcoma,chondrosarcoma, osteogenic sarcoma, chordoma, angiosarcoma,endotheliosarcoma, lymphangiosarcoma, lymphangioendotheliosarcoma,synovioma, mesothelioma, Ewing's, leiomyosarcoma, rhabdomyosarcoma,gastrointestinal system carcinomas, colon carcinoma, pancreatic cancer,breast cancer, genitourinary system carcinomas, ovarian cancer, prostatecancer, squamous cell carcinoma, basal cell carcinoma, adenocarcinoma,sweat gland carcinoma, sebaceous gland carcinoma, papillary carcinoma,papillary adenocarcinomas, cystadenocarcinoma, medullary carcinoma,bronchogenic carcinoma, renal cell carcinoma, hepatoma, bile ductcarcinoma, choriocarcinoma, seminoma, embryonal carcinoma, Wilms' tumor,cervical cancer, endocrine system carcinomas, testicular tumor, lungcarcinoma, small cell lung carcinoma, non-small cell lung carcinoma,bladder carcinoma, epithelial carcinoma, glioma, astrocytoma,medulloblastoma, craniopharyngioma, ependymoma, pinealoma,hemangioblastoma, acoustic neuroma, oligodendroglioma, meningioma,melanoma, neuroblastoma, retinoblastoma, or combinations thereof. Thesubstance may be associated with various types of organs. Non-limitingexamples of organs may include brain, liver, lung, kidney, prostate,ovary, spleen, lymph node (including tonsil), thyroid, pancreas, heart,skeletal muscle, intestine, larynx, esophagus, stomach, or combinationsthereof. In some cases, the substance may comprise a variety of cells,including but not limited to: eukaryotic cells, prokaryotic cells, fungicells, heart cells, lung cells, kidney cells, liver cells, pancreascells, reproductive cells, stem cells, induced pluripotent stem cells,gastrointestinal cells, blood cells, cancer cells, bacterial cells,bacterial cells isolated from a human microbiome sample, etc. In somecases, the substance may comprise contents of a cell, such as, forexample, the contents of a single cell or the contents of multiplecells. Methods and systems for analyzing individual cells are providedin, e.g., U.S. patent application Ser. No. 14/752,641, filed Jun. 26,2015, the full disclosure of which is hereby incorporated by referencein its entirety, particularly all teachings related to analyzing nucleicacids from individual cells.

Samples may be obtained from various subjects. A subject may be a livingsubject or a dead subject. Examples of subjects may include, but notlimited to, humans, mammals, non-human mammals, rodents, amphibians,reptiles, canines, felines, bovines, equines, goats, ovines, hens,avines, mice, rabbits, insects, slugs, microbes, bacteria, parasites, orfish. In some cases, the subject may be a patient who is having,suspected of having, or at a risk of developing a disease or disorder.In some cases, the subject may be a pregnant woman. In some case, thesubject may be a normal healthy pregnant woman. In some cases, thesubject may be a pregnant woman who is at a risking of carrying a babywith certain birth defect.

A sample may be obtained from a subject by any means known in the art.For example, a sample may be obtained from a subject through accessingthe circulatory system (e.g., intravenously or intra-arterially via asyringe or other apparatus), collecting a secreted biological sample(e.g., saliva, sputum urine, feces, etc.), surgically (e.g., biopsy)acquiring a biological sample (e.g., intra-operative samples,post-surgical samples, etc.), swabbing (e.g., buccal swab, oropharyngealswab), or pipetting.

While preferred embodiments of the present invention have been shown anddescribed herein, it will be obvious to those skilled in the art thatsuch embodiments are provided by way of example only. Numerousvariations, changes, and substitutions will now occur to those skilledin the art without departing from the invention. It should be understoodthat various alternatives to the embodiments of the invention describedherein may be employed in practicing the invention. It is intended thatthe following claims define the scope of the invention and that methodsand structures within the scope of these claims and their equivalents becovered thereby.

EXAMPLES Example 1 Whole Exome Capture and Sequencing: NA 12878

Genomic DNA from the NA12878 human cell line was subjected to size basedseparation of fragments using a Blue Pippin DNA sizing system to recoverfragments that were greater than or equal to approximately 10 kb inlength. The size selected sample nucleic acids were then copartitionedwith barcode beads in aqueous droplets within a fluorinated oilcontinuous phase using a microfluidic partitioning system (See, e.g.,U.S. patent application Ser. No. 14/682,952, filed Apr. 9, 2015, andincorporated herein by reference in its entirety for all purposes),where the aqueous droplets also included the dNTPs, thermostable DNApolymerase and other reagents for carrying out amplification within thedroplets, as well as DTT for releasing the barcode oligonucleotides fromthe beads. This was repeated both for 1 ng of total input DNA and 2 ngof total input DNA. The barcode beads were obtained as a subset of astock library that represented barcode diversity of over 700,000different barcode sequences. The barcode containing oligonucleotidesincluded additional sequence components and had the general structure:

-   -   Bead-P5-BC-R1-Nmer

Where P5 and R1 refer to the IIlumina attachment and Read1 primersequences, respectively, BC denotes the barcode portion of theoligonucleotide, and Nmer denotes a random 10 base N-mer primingsequence used to prime the template nucleic acids. See, e.g., U.S.patent application Ser. No. 14/316,383, filed Jun. 26, 2014, the fulldisclosure of which is hereby incorporated herein by reference in itsentirety for all purposes.

Following bead dissolution, the droplets were thermocycled to allow forprimer extension of the barcode oligos against the template of thesample nucleic acids within each droplet. This resulted in amplifiedcopy fragments of the sample nucleic acids that included the barcodesequence representative of the originating partition, in addition to theother included sequences set forth above.

After barcode labeling of the copy fragments, the emulsion of dropletsincluding the amplified copy fragments was broken and the additionalsequencer required components, e.g., read2 primer sequence and P7attachment sequence, were added to the copy fragments through anadditional amplification step, which attached these sequences to theother end of the copy fragments. The barcoded DNA was then subjected tohybrid capture using an Agilent SureSelect Exome capture kit.

The table below provides targeting statistics for the NA 12878 genome:

Median % Fragments % Bases Sample Insert Size on Target on TargetVersion 1.A 258 81% 51% Version 1.B 224 81% 55% Version 1.C 165 81% 63%

The three different versions listed above represent three differentshear lengths for the barcoded fragments before the second adapterattachment step.

Example 2 Whole Exome Capture and Sequencing: NA 19701 and NA 19661

Genomic DNA from the NA19701 and NA19661cell lines was preparedaccording to the methods described above in Example 1. Data, includingphasing data, from those two cells lines is provided in the table below:

NA19661 NA19701 N50_phase_block 29,535 83,953 N90_phase_block 8,59525,684 mean_phase_block 5,968 21,128 median_phase_block 0 76.5longest_phase_block 209,323 504,140 fract_genes_phased 0.719 0.841fract_genes_completely_phased 0.679 0.778 fract_snps_phased 0.869 0.832fract_snps_barcode 0.644 0.607 fract_snps_barcode_both_alleles 0.3280.351 prob_snp_correct_in_gene 0.906 0.927 prob_snp_phased_in_gene 0.8070.889 snp_short_switch_error 0.013 0.013 snp_long_switch_error 0.0120.013

The present specification provides a complete description of themethodologies, systems and/or structures and uses thereof in exampleaspects of the presently-described technology. Although various aspectsof this technology have been described above with a certain degree ofparticularity, or with reference to one or more individual aspects,those skilled in the art could make numerous alterations to thedisclosed aspects without departing from the spirit or scope of thetechnology hereof. Since many aspects can be made without departing fromthe spirit and scope of the presently described technology, theappropriate scope resides in the claims hereinafter appended. Otheraspects are therefore contemplated. Furthermore, it should be understoodthat any operations may be performed in any order, unless explicitlyclaimed otherwise or a specific order is inherently necessitated by theclaim language. It is intended that all matter contained in the abovedescription and shown in the accompanying drawings shall be interpretedas illustrative only of particular aspects and are not limiting to theembodiments shown. Unless otherwise clear from the context or expresslystated, any concentration values provided herein are generally given interms of admixture values or percentages without regard to anyconversion that occurs upon or following addition of the particularcomponent of the mixture. To the extent not already expresslyincorporated herein, all published references and patent documentsreferred to in this disclosure are incorporated herein by reference intheir entirety for all purposes. Changes in detail or structure may bemade without departing from the basic elements of the present technologyas defined in the following claims.

What is claimed:
 1. A nucleic acid sequencing library comprising aplurality of nucleic acid fragments from a genomic sample, wherein: (a)the plurality of nucleic acid fragments includes fragments from two ormore disparate portions of the genomic sample; and (b) fragments withinthe plurality of nucleic acid fragments that are less than 1 kb apart inthe genomic sample share a common barcode.
 2. The nucleic acidsequencing library of claim 1, wherein the two or more disparateportions of the genomic sample comprises sequences separated by at least10 kb.
 3. The nucleic acid sequencing library of claim 1, wherein thetwo or more disparate portions of the genomic sample comprises sequencesseparated by at least 50 kb.
 4. The nucleic acid sequencing library ofclaim 1, wherein the two or more disparate portions of the genomicsample comprises sequences separated by at least 100 kb.
 5. The nucleicacid sequencing library of claim 1, wherein the two or more disparateportions of the genomic sample comprise exons separated by long intronicregions.
 6. The nucleic acid sequencing library of claim 1, wherein thetwo or more disparate portions of the genomic sample comprise two ormore individual genes, two or more gene groups, or two or more exons. 7.The nucleic acid sequencing library of claim 1, wherein the two or moredisparate portions of the genomic sample comprises sequences from thesame chromosome.
 8. The nucleic acid sequencing library of claim 1,wherein the two or more disparate portions of the genomic samplecomprise sequences from about every 50 kb along the entire genome. 9.The nucleic acid sequencing library of claim 1, wherein the two or moredisparate portions of the genomic sample comprise sequences from aboutevery 100 kb along the entire genome.
 10. The nucleic acid sequencinglibrary of claim 1, wherein the nucleic acid sequencing library furthercomprises fragments that hybridize to a bait library comprising captureprobes, wherein at least a majority of the capture probes in the baitlibrary are designed to hybridize to informative single nucleotidepolymorphisms (SNPs).
 11. The bait library of claim 10, wherein theinformative SNPs are located within both exons and introns in thetargeted portions of the genomic sample.
 12. The bait library of claim10, wherein the majority of the probes in the bait library are furtherdesigned to hybridize to informative SNPs spaced apart by about 1kilobase to about 15 kilobases (kb).
 13. The bait library of claim 10,wherein the majority of the probes in the bait library are furtherdesigned to hybridize to informative SNPs spaced apart by about 5 kb toabout 10 kb.
 14. The bait library of claim 10, wherein a plurality ofprobes within the library are further designed to meet one or more ofthe following conditions in any combination: (i) for targeted portionsof the genome in which there are no informative SNPs within 10-50 kb ofboundaries between exons and introns, the plurality of probes isdesigned to hybridize at an informative SNP within an intron from thoseboundaries; (ii) for targeted portions of the genome in which there is afirst informative SNP within an exon and that first informative SNP islocated 10-50 kb from a boundary with an adjacent intron and a secondinformative SNP within the adjacent intron and that second informativeSNP is located 10-50 kb from the boundary, the plurality of probes isdesigned to hybridize to a region of the genome between the first andsecond informative SNPs; (iii) for targeted portions of the genomecomprising no informative SNPs for at least 10-50 kb, the plurality ofprobes is designed to hybridize every 0.5, 1, 3, or 5 kb to thosetargeted portions of the genome; (iv) for targeted portions of thegenome in which there are no informative SNPs within 10-50 kb ofboundaries between exons and introns, the plurality of probes aredesigned to hybridize to the next closest informative SNP to theexon-intron boundaries.
 15. The bait library of claim 10, wherein thelibrary of probes comprises probes designed to hybridize to regions ofthe genomic sample that flank exons at a density that provides linkageinformation across barcodes.
 16. The bait library of claim 10, whereinthe library of probes is optimized for coverage of the targeted portionsof the genomic sample, and wherein the targeted portions of the genomicsample comprise regions of high map quality.